论文信息 - Continual flow pipelines: achieving resource-efficient latency tolerance

Continual flow pipelines: achieving resource-efficient latency tolerance

With the natural trend toward integration, microprocessors are increasingly supporting multiple cores on a single chip. To keep design effort and costs down, designers of these multicore microprocessors frequently target an entire product range, from mobile laptops to high-end servers. This article discusses a continual flow pipeline (CFP) processor. Such processor architecture can sustain a large number of in-flight instructions (commonly referred to as the instruction window and comprising all instructions renamed but not retired) without requiring the cycle-critical structures to scale up. By keeping these structures small and making the processor core tolerant of memory latencies, a CFP mechanism enables the new core to achieve high single-thread performance, and many of these new cores can be placed on a chip for high throughput. The resulting large instruction window reveals substantial instruction-level parallelism and achieves memory latency tolerance, while the small size of cycle-critical resources permits a high clock frequency

[1] Trevor N. Mudge,et al. Author retrospective improving data cache performance by pre-executing instructions under a cache miss , 1997, International Conference on Supercomputing.

[2] Haitham Akkary,et al. Continual flow pipelines , 2004, ASPLOS XI.

[3] Josep Llosa,et al. Out-of-order commit processors , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[4] Josep Llosa,et al. Large virtual robs by processor checkpointing , 2002 .

[5] Tejas Karkhanis,et al. A Day in the Life of a Data Cache Miss , 2002 .

[6] Eric Rotenberg,et al. A large, fast instruction window for tolerating cache misses , 2002, ISCA.

[7] Onur Mutlu,et al. Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[8] Haitham Akkary,et al. Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors , 2003, MICRO.

[9] Haitham Akkary,et al. Checkpoint Processing and Recovery: An Efficient, Scalable Alternative to Reorder Buffers , 2003, IEEE Micro.

[10] Alvin M. Despain,et al. The 16-fold way: a microparallel taxonomy , 1993, MICRO 1993.

[11] David J. Sager,et al. The microarchitecture of the Pentium 4 processor , 2001 .

[12] Mateo Valero,et al. Dynamic Register Renaming Through Virtual-Physical Registers , 2000, J. Instr. Level Parallelism.