Incremental commit groups for non-atomic trace processing

We introduce techniques to support efficient non-atomic execution of very long traces on a new binary translation based, /spl times/86-64 compatible VLIW microprocessor. Incrementally committed long traces significantly reduce wasted computations on exception induced rollbacks by retaining the correctly committed parts of traces. We divide each scheduled trace into multiple commit groups; groups are committed to the architectural state after all instructions within and prior to each group complete without exceptions. Architectural state updates are only visible after future commit points are deferred using a simple hardware commit buffer. We employ a commit depth predictor to predict how many groups a trace will complete, thereby eliminating pipeline flushes on repeated rollbacks. Unlike atomic traces, we allow instructions to be freely scheduled across commit points throughout the trace to maximize ILP. Commit groups are formed after scheduling, allowing the commit points terminating each group to be inserted more optimally. Commit groups promote significantly faster convergence on optimized traces, since we salvage partially executed traces and splice the working parts together into new optimized traces. We use detailed models to demonstrate how commit groups substantially improve performance (on average, over 1.5/spl times/ on SPEC 2000) relative to atomic traces.

[1]  Wei-Chung Hsu,et al.  Dynamic trace selection using performance monitoring hardware sampling , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[2]  Kanad Ghose,et al.  Increasing processor performance through early register release , 2004, IEEE International Conference on Computer Design: VLSI in Computers and Processors, 2004. ICCD 2004. Proceedings..

[3]  Sanjay J. Patel,et al.  Increasing the size of atomic instruction blocks using control flow assertions , 2000, MICRO 33.

[4]  Richard Johnson,et al.  The Transmeta Code Morphing/spl trade/ Software: using speculation, recovery, and adaptive retranslation to address real-life challenges , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[5]  Haitham Akkary,et al.  Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors , 2003, MICRO.

[6]  David J. Sager,et al.  The microarchitecture of the Pentium 4 processor , 2001 .

[7]  B. Ramakrishna Rau,et al.  Dynamically scheduled VLIW processors , 1993, Proceedings of the 26th Annual International Symposium on Microarchitecture.

[8]  Andrew R. Pleszkun,et al.  Implementation of precise interrupts in pipelined processors , 1985, ISCA '98.

[9]  Gary S. Tyson,et al.  Improving the accuracy and performance of memory communication through renaming , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[10]  Andreas Moshovos,et al.  Speculative Memory Cloaking and Bypassing , 1999, International Journal of Parallel Programming.

[11]  Sumedh W. Sathaye,et al.  A fast interrupt handling scheme for VLIW processors , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[12]  Richard Johnson,et al.  The Transmeta Code Morphing#8482; Software: using speculation, recovery, and adaptive retranslation to address real-life challenges , 2003, CGO.

[13]  Michael D. Smith,et al.  Efficient superscalar performance through boosting , 1992, ASPLOS V.

[14]  James E. Smith,et al.  Path-based next trace prediction , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[15]  P.P. Gelsinger,et al.  Microprocessors circa 2000 , 1989, IEEE Spectrum.

[16]  Michael J. Flynn,et al.  Vliw processors: efficiently exploiting instruction level parallelism , 2000 .

[17]  Joel S. Emer,et al.  Memory dependence prediction using store sets , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[18]  Hwa C. Torng,et al.  Interrupt Handling for Out-of-Order Execution Processors , 1993, IEEE Trans. Computers.

[19]  Michael C. Huang,et al.  Cherry: checkpointed early resource recycling in out-of-order microprocessors , 2002, MICRO.

[20]  Andrew R. Pleszkun,et al.  Implementing Precise Interrupts in Pipelined Processors , 1988, IEEE Trans. Computers.

[21]  Paolo Faraboschi,et al.  An analysis of dynamic scheduling techniques for symbolic applications , 1993, Proceedings of the 26th Annual International Symposium on Microarchitecture.

[22]  Sanjay J. Patel,et al.  The Performance Potential of Trace-based Dynamic Optimization , 2004 .

[23]  윙맬컴제이.,et al.  Method and apparatus for aliasing memory data in an advanced microprocessor , 1997 .

[24]  Josep Llosa,et al.  Out-of-order commit processors , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).