Hiding the misprediction penalty of a resource-efficient high-performance processor

Misprediction is a major obstacle for increasing speculative out-of-order processors performance. Performance degradation depends on both the number of misprediction events and the recovery time associated with each one of them. In recent years a few checkpoint based microarchitectures have been proposed. In comparison with ROB-based processors, checkpoint processors are scalable and highly resource efficient. Unfortunately, in these proposals the misprediction recovery time is proportional to the instruction queue size. In this paper we analyze methods to reduce the misprediction recovery time. We propose a new register file management scheme and techniques to selectively flush the instruction queue and the load store queue, and to isolate deeply pipelined execution units. The result is a novel checkpoint processor with Constant misprediction RollBack time (CRB). We further present a streamlined, cost-efficient solution, which saves complexity at the price of slightly lower performance.

[1]  Kenneth C. Yeager The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.

[2]  Justin R. Rattner Multi-Core to the Masses , 2005, IEEE PACT.

[3]  Josep Torrellas,et al.  Hardware for speculative parallelization of partially-parallel loops in DSM multiprocessors , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[4]  José González,et al.  Dual path instruction processing , 2002, ICS '02.

[5]  Andreas Moshovos Checkpointing alternatives for high performance, power-aware processors , 2003, ISLPED '03.

[6]  John Paul Shen,et al.  Reducing branch misprediction penalties via dynamic control independence detection , 1999, ICS '99.

[7]  Ravi Rajwar,et al.  Speculative lock elision: enabling highly concurrent multithreaded execution , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[8]  Dirk Grunwald,et al.  Confidence estimation for speculation control , 1998, ISCA.

[9]  Kunle Olukotun,et al.  Maximizing CMP throughput with mediocre cores , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[10]  E. Smith,et al.  Selective Dual Path Execution , 1996 .

[11]  Ravi Rajwar,et al.  The impact of performance asymmetry in emerging multicore architectures , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[12]  Josep Llosa,et al.  Out-of-order commit processors , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[13]  Kunle Olukotun,et al.  The common case transactional behavior of multithreaded programs , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[14]  Josep Llosa,et al.  A case for resource-conscious out-of-order processors , 2004, IEEE Computer Architecture Letters.

[15]  Wei Liu,et al.  ReSlice: selective re-execution of long-retired misspeculated instructions using forward slicing , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[16]  Dirk Grunwald,et al.  Pipeline gating: speculation control for energy reduction , 1998, ISCA.

[17]  Haitham Akkary,et al.  An analysis of a resource efficient checkpoint architecture , 2004, TACO.

[18]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[19]  James R. Goodman,et al.  Speculative lock elision: enabling highly concurrent multithreaded execution , 2001, MICRO.

[20]  Haitham Akkary,et al.  Reducing branch misprediction penalty via selective branch recovery , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[21]  David J. Sager,et al.  The microarchitecture of the Pentium 4 processor , 2001 .

[22]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[23]  Santosh G. Abraham,et al.  Chip multithreading: opportunities and challenges , 2005, 11th International Symposium on High-Performance Computer Architecture.

[24]  Pierre Michaud,et al.  A case for (partially) TAgged GEometric history length branch prediction , 2006, J. Instr. Level Parallelism.

[25]  J. T. Robinson,et al.  On optimistic methods for concurrency control , 1979, TODS.

[26]  Mikko H. Lipasti,et al.  Exceeding the dataflow limit via value prediction , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[27]  Andrew R. Pleszkun,et al.  Implementing Precise Interrupts in Pipelined Processors , 1988, IEEE Trans. Computers.

[28]  Onur Mutlu,et al.  Address-value delta (AVD) prediction: increasing the effectiveness of runahead execution by exploiting regular memory allocation patterns , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[29]  Onur Mutlu,et al.  Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[30]  John Paul Shen,et al.  Best of both latency and throughput , 2004, IEEE International Conference on Computer Design: VLSI in Computers and Processors, 2004. ICCD 2004. Proceedings..

[31]  Kunle Olukotun,et al.  Transactional memory coherence and consistency , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[32]  James A. Kahle,et al.  The Cell Processor Architecture , 2005, MICRO.

[33]  Andreas Moshovos,et al.  Read-after-read memory dependence prediction , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[34]  Haitham Akkary,et al.  Continual flow pipelines , 2004, ASPLOS XI.

[35]  Eric Rotenberg,et al.  Assigning confidence to conditional branch predictions , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[36]  Alan Jay Smith,et al.  Branch Prediction Strategies and Branch Target Buffer Design , 1995, Computer.

[37]  Yale N. Patt,et al.  Checkpoint Repair for High-Performance Out-of-Order Execution Machines , 1987, IEEE Transactions on Computers.

[38]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[39]  Haitham Akkary,et al.  Checkpoint processing and recovery: towards scalable large instruction window processors , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[40]  Haitham Akkary,et al.  Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors , 2003, MICRO.

[41]  Onur Mutlu,et al.  On Reusing the Results of Pre-Executed Instructions in a Runahead Execution Processor , 2005, IEEE Computer Architecture Letters.

[42]  Mikko H. Lipasti,et al.  Modern Processor Design: Fundamentals of Superscalar Processors , 2002 .

[43]  Mateo Valero,et al.  Toward kilo-instruction processors , 2004, TACO.

[44]  Peng Zhou,et al.  Fast branch misprediction recovery in out-of-order superscalar processors , 2005, ICS '05.

[45]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor , 1999, IEEE Micro.

[46]  Uri C. Weiser,et al.  Performance, power efficiency and scalability of asymmetric cluster chip multiprocessors , 2006, IEEE Computer Architecture Letters.

[47]  Jose Renau,et al.  CAVA: Using checkpoint-assisted value prediction to hide L2 misses , 2006, TACO.

[48]  Stamatis Vassiliadis,et al.  Register renaming and dynamic speculation: an alternative approach , 1993, Proceedings of the 26th Annual International Symposium on Microarchitecture.

[49]  Ramon Canal,et al.  Reducing the complexity of the issue logic , 2001, ICS '01.

[50]  Haitham Akkary,et al.  Scalable Load and Store Processing in Latency-Tolerant Processors , 2005, IEEE Micro.