Scalable selective re-execution for EDGE architectures

Pipeline flushes are becoming increasingly expensive in modern microprocessors with large instruction windows and deep pipelines. Selective re-execution is a technique that can reduce the penalty of mis-speculations by re-executing only instructions affected by the mis-speculation, instead of all instructions. In this paper we introduce a new selective re-execution mechanism that exploits the properties of a dataflow-like Explicit Data Graph Execution (EDGE) architecture to support efficient mis-speculation recovery, while scaling to window sizes of thousands of instructions with high performance. This distributed selective re-execution (DSRE) protocol permits multiple speculative waves of computation to be traversing a dataflow graph simultaneously, with a commit wave propagating behind them to ensure correct execution. We evaluate one application of this protocol to provide efficient recovery for load-store dependence speculation. Unlike traditional dataflow architectures which resorted to single-assignment memory semantics, the DSRE protocol combines dataflow execution with speculation to enable high performance and conventional sequential memory semantics. Our experiments show that the DSRE protocol results in an average 17% speedup over the best dependence predictor proposed to date, and obtains 82% of the performance possible with a perfect oracle directing the issue of loads.

[1]  Doug Burger,et al.  Measuring Experimental Error in Microprocessor Simulation , 2001, ISCA 2001.

[2]  Wen-mei W. Hwu,et al.  Compiler-directed early load-address generation , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[3]  Eric Rotenberg,et al.  A study of value speculative execution and misspeculation recovery in superscalar microprocessors , 2000 .

[4]  Calvin Lin,et al.  Combining Hyperblocks and Exit Prediction to Increase Front-End Bandwidth and Performance , 2002 .

[5]  Vikas Agarwal,et al.  Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[6]  Mikko H. Lipasti,et al.  Understanding scheduling replay schemes , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[7]  Stéphan Jourdan,et al.  Speculation techniques for improving load related instruction scheduling , 1999, ISCA.

[8]  Quinn Jacobson,et al.  A study of control independence in superscalar processors , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[9]  Lizy Kurian John,et al.  Scaling to the end of silicon with EDGE architectures , 2004, Computer.

[10]  Haitham Akkary,et al.  Checkpoint processing and recovery: towards scalable large instruction window processors , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[11]  Andreas Moshovos,et al.  Dynamic Speculation and Synchronization of Data Dependences , 1997, ISCA.

[12]  Quinn Jacobson,et al.  Trace processors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[13]  Jaehyuk Huh,et al.  Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture , 2003, IEEE Micro.

[14]  R. Nagarajan,et al.  A design space evaluation of grid processor architectures , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[15]  Todd Austin,et al.  Practical Selective Replay for Reduced-Tag Schedulers , 2003 .

[16]  Scott A. Mahlke,et al.  Dynamic memory disambiguation using the memory conflict buffer , 1994, ASPLOS VI.

[17]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor , 1999, IEEE Micro.

[18]  Yashwant K. Malaiya Proceedings of the 24th annual international symposium on Microarchitecture , 1991 .

[19]  Doug Matzke,et al.  Will Physical Scalability Sabotage Performance Gains? , 1997, Computer.

[20]  Yale N. Patt,et al.  Reducing the performance impact of instruction cache misses by writing instructions into the reservation stations out-of-order , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[21]  Glenn Reinman,et al.  A Comparative Survey of Load Speculation Architectures , 2000, J. Instr. Level Parallelism.

[22]  Scott A. Mahlke,et al.  Trimaran: An Infrastructure for Research in Instruction-Level Parallelism , 2004, LCPC.

[23]  Jaehyuk Huh,et al.  Coherence decoupling: making use of incoherence , 2004, ASPLOS XI.

[24]  Haitham Akkary,et al.  Reducing branch misprediction penalty via selective branch recovery , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[25]  David J. Sager,et al.  The microarchitecture of the Pentium 4 processor , 2001 .

[26]  David E. Culler,et al.  Monsoon: an explicit token-store architecture , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[27]  Trevor Mudge,et al.  Razor: a low-power pipeline based on circuit-level timing speculation , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..