Continual flow pipelines: achieving resource-efficient latency tolerance

With the natural trend toward integration, microprocessors are increasingly supporting multiple cores on a single chip. To keep design effort and costs down, designers of these multicore microprocessors frequently target an entire product range, from mobile laptops to high-end servers. This article discusses a continual flow pipeline (CFP) processor. Such processor architecture can sustain a large number of in-flight instructions (commonly referred to as the instruction window and comprising all instructions renamed but not retired) without requiring the cycle-critical structures to scale up. By keeping these structures small and making the processor core tolerant of memory latencies, a CFP mechanism enables the new core to achieve high single-thread performance, and many of these new cores can be placed on a chip for high throughput. The resulting large instruction window reveals substantial instruction-level parallelism and achieves memory latency tolerance, while the small size of cycle-critical resources permits a high clock frequency

[1]  Trevor N. Mudge,et al.  Author retrospective improving data cache performance by pre-executing instructions under a cache miss , 1997, International Conference on Supercomputing.

[2]  Haitham Akkary,et al.  Continual flow pipelines , 2004, ASPLOS XI.

[3]  Josep Llosa,et al.  Out-of-order commit processors , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[4]  Josep Llosa,et al.  Large virtual robs by processor checkpointing , 2002 .

[5]  Tejas Karkhanis,et al.  A Day in the Life of a Data Cache Miss , 2002 .

[6]  Eric Rotenberg,et al.  A large, fast instruction window for tolerating cache misses , 2002, ISCA.

[7]  Onur Mutlu,et al.  Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[8]  Haitham Akkary,et al.  Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors , 2003, MICRO.

[9]  Haitham Akkary,et al.  Checkpoint Processing and Recovery: An Efficient, Scalable Alternative to Reorder Buffers , 2003, IEEE Micro.

[10]  Alvin M. Despain,et al.  The 16-fold way: a microparallel taxonomy , 1993, MICRO 1993.

[11]  David J. Sager,et al.  The microarchitecture of the Pentium 4 processor , 2001 .

[12]  Mateo Valero,et al.  Dynamic Register Renaming Through Virtual-Physical Registers , 2000, J. Instr. Level Parallelism.