CPROB: Checkpoint Processing with Opportunistic Minimal Recovery

CPR (Checkpoint Processing and Recovery) is a physical register management scheme that supports a larger instruction window and higher average IPC than conventional ROB-style register management. It does so by restricting mis-speculation recovery to checkpoints created at rename, and leveraging this restriction to aggressively reclaim registers that don't appear in checkpoints. The cost of CPR is checkpoint overhead, which is incurred when a mis-speculation occurs on an instruction for which a checkpoint was not created a priori. Here, CPR must recover to the immediately older checkpoint, squashing instructions older than the mis-speculation itself. In contrast, a ROB processor performs minimal recovery and only squashes instructions younger than the mis-speculation. CPROB is a hybrid register management scheme that preserves CPR's aggressive reclamation while opportunistically minimizing checkpoint overhead. CPROB extends CPR to track and hold the registers needed to perform minimal recovery to un-executed branches within each checkpoint. Recovery registers are held on a best-effort basis only. A checkpoint's recovery registers can be freed spontaneously when all branches in the checkpoint execute. They can also be aggressively victimized if dispatch needs registers to proceed. CPROB naturally adapts the register reclamation policy to dynamic branch behavior. When branch mis-predictions are infrequent and registers are needed to support a large window, CPROB victimizes registers and behaves like CPR. When mis-predictions are frequent and the window is small, CPROB holds on to registers and behaves like ROB. As a result, it out-performs both CPR and ROB for a given program. This performance improvement, combined with reduced checkpoint overhead, makes CPROB more energy-efficient than either ROB or CPR.

[1]  Haitham Akkary,et al.  Continual flow pipelines , 2004, ASPLOS XI.

[2]  Quinn Jacobson,et al.  A study of control independence in superscalar processors , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[3]  Amir Roth,et al.  Decoupled store completion/silent deterministic replay: enabling scalable data memory for CPR/CFP processors , 2009, ISCA '09.

[4]  Manoj Franklin,et al.  Boosting SMT performance by speculation control , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[5]  Chen-Yong Cher,et al.  Skipper: a microarchitecture for exploiting control-flow independence , 2001, MICRO.

[6]  Pierre Michaud,et al.  A PPM-like, Tag-based Predictor. , 2005 .

[7]  David A. Koufaty,et al.  Hyperthreading Technology in the Netburst Microarchitecture , 2003, IEEE Micro.

[8]  Amir Roth,et al.  DISE: a programmable macro engine for customizing applications , 2003, ISCA '03.

[9]  J.F. Martinez,et al.  Cherry: Checkpointed early resource recycling in out-of-order microprocessors , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[10]  Andreas Moshovos,et al.  Turbo-ROB: A Low Cost Checkpoint/Restore Accelerator , 2008, HiPEAC.

[11]  Stamatis Vassiliadis,et al.  Register renaming and dynamic speculation: an alternative approach , 1993, MICRO.

[12]  Dean M. Tullsen,et al.  Software-Directed Register Deallocation for Simultaneous Multithreaded Processors , 1999, IEEE Trans. Parallel Distributed Syst..

[13]  Amir Roth,et al.  FIESTA: A Sample-Balanced Multi-Program Workload Methodology , 2009 .

[14]  Milo M. K. Martin,et al.  Scalable store-load forwarding via store queue index prediction , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[15]  A. Roth,et al.  Physical register reference counting , 2008, IEEE Computer Architecture Letters.

[16]  Andreas Moshovos,et al.  BranchTap: improving performance with very few checkpoints through adaptive speculation control , 2006, ICS '06.

[17]  John Paul Shen,et al.  Reducing branch misprediction penalties via dynamic control independence detection , 1999, ICS '99.

[18]  Amir Roth,et al.  Dataflow Mini-Graphs: Amplifying Superscalar Capacity and Bandwidth , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[19]  Steven K. Reinhardt,et al.  The impact of resource partitioning on SMT processors , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[20]  Mateo Valero,et al.  A distributed processor state management architecture for large-window processors , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[21]  Eric Rotenberg,et al.  Transparent control independence (TCI) , 2007, ISCA '07.

[22]  Alvin M. Despain,et al.  The 16-fold way: a microparallel taxonomy , 1993, MICRO 1993.

[23]  Koen De Bosschere,et al.  2FAR: A 2bcgskew Predictor Fused by an Alloyed Redundant History Skewed Perceptron Branch Predictor , 2005, J. Instr. Level Parallelism.

[24]  Amir Roth,et al.  RENO: a rename-based instruction optimizer , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[25]  Mateo Valero,et al.  A decoupled KILO-instruction processor , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[26]  Haitham Akkary,et al.  Checkpoint processing and recovery: towards scalable large instruction window processors , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[27]  Angela Arapoyanni,et al.  On the latency, energy and area of checkpointed, superscalar register alias tables , 2007, Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07).

[28]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[29]  Alain J. Martin,et al.  ET 2 : a metric for time and energy efficiency of computation , 2002 .

[30]  Amir Roth,et al.  Ginger: control independence using tag rewriting , 2007, ISCA '07.

[31]  Haitham Akkary,et al.  Reducing branch misprediction penalty via selective branch recovery , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[32]  Mateo Valero,et al.  Toward kilo-instruction processors , 2004, TACO.

[33]  Amir Roth,et al.  Store vulnerability window (SVW): re-execution filtering for enhanced load optimization , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[34]  Shlomo Weiss,et al.  Hiding the misprediction penalty of a resource-efficient high-performance processor , 2008, TACO.

[35]  Gurindar S. Sohi,et al.  A Quantitative Framework for Pre-Exe-cution Thread Selection , 2002 .

[36]  Kanad Ghose,et al.  Increasing processor performance through early register release , 2004, IEEE International Conference on Computer Design: VLSI in Computers and Processors, 2004. ICCD 2004. Proceedings..

[37]  Mateo Valero,et al.  Delaying physical register allocation through virtual-physical registers , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[38]  Eric Rotenberg,et al.  Assigning confidence to conditional branch predictions , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[39]  Dean M. Tullsen,et al.  Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).