rePLay: A Hardware Framework for Dynamic Optimization

In this paper, we propose a new processor framework that supports dynamic optimization. The rePLay Framework embeds an optimization engine atop a high-performance execution engine. The heart of the rePLay Framework is the concept of a frame. Frames are large, single-entry, single-exit optimization regions spanning many basic blocks in the program's dynamic instruction stream, yet containing only a single flow of control. This atomic property of frames increases the flexibility in applying optimizations. To support frames, rePLay includes a hardware-based recovery mechanism that rolls back the architectural state to the beginning of a frame if, for example, an early exit condition is detected. This mechanism permits the optimizer to make speculative, aggressive optimizations upon frames. In this paper, we investigate some of the underlying phenomenon that support rePLay. Primarily, we evaluate rePLay's region formation strategy. A rePLay configuration with a 256-entry frame cache, using 74 KB frame constructor and frame sequencer, achieves an average frame size of 88 Alpha AXP instructions with 68 percent coverage of the dynamic istream, an average frame completion rate of 92.81 percent, and a frame predictor accuracy of 81.26 percent. These results soundly demonstrate that the frames upon which the optimizations are performed are large and stable. Using the most frequently initiated frames from rePLay executions as samples, we also highlight possible strategies for the rePLay optimization engine. Coupled with the high coverage of frames achieved through the dynamic frame construction, the success of these optimizations demonstrates the significance of the rePLay Framework. We believe that the concept of frames, along with the mechanisms and strategies outlined in this paper, will play an important role in future processor architecture.

[1]  Wen-mei W. Hwu,et al.  Compiler-directed dynamic computation reuse: rationale and initial results , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[2]  Scott A. Mahlke,et al.  Effective compiler support for predicated execution using the hyperblock , 1992, MICRO 25.

[3]  Michael D. Smith,et al.  Improving the accuracy of static branch prediction using branch correlation , 1994, ASPLOS VI.

[4]  Joseph A. Fisher,et al.  Trace Scheduling: A Technique for Global Microcode Compaction , 1981, IEEE Transactions on Computers.

[5]  Daniel H. Friendly,et al.  Evaluation of Design Options for the Trace Cache Fetch Mechanism , 1999, IEEE Trans. Computers.

[6]  Quinn Jacobson,et al.  Instruction pre-processing in trace processors , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[7]  Yale N. Patt,et al.  Improving trace cache effectiveness with branch promotion and trace packing , 1998, ISCA.

[8]  Yale N. Patt,et al.  Improving branch prediction accuracy by reducing pattern history table interference , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.

[9]  K. Ebcioglu,et al.  Daisy: Dynamic Compilation For 10o?40 Architectural Compatibility , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[10]  Yale N. Patt,et al.  Putting the fill unit to work: dynamic optimizations for trace cache microprocessors , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[11]  James E. Smith,et al.  Path-based next trace prediction , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[12]  Sanjay J. Patel,et al.  Increasing the size of atomic instruction blocks using control flow assertions , 2000, MICRO 33.

[13]  Michael Gschwind,et al.  Dynamic and Transparent Binary Translation , 2000, Computer.

[14]  Scott A. Mahlke,et al.  Superblock formation using static program analysis , 1993, Proceedings of the 26th Annual International Symposium on Microarchitecture.

[15]  Gang Chen,et al.  GPMB—software pipelining branch-intensive loops , 1993, MICRO 1993.

[16]  Erik R. Altman,et al.  Daisy: Dynamic Compilation For 10o?40 Architectural Compatibility , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[17]  Michael D. Smith,et al.  Overcoming the Challenges to Feedback-Directed Optimization , 2000, Dynamo.

[18]  Wen-mei W. Hwu,et al.  A hardware mechanism for dynamic extraction and relayout of program hot spots , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[19]  Eric Rotenberg,et al.  Assigning confidence to conditional branch predictions , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[20]  G.S. Sohi,et al.  Dynamic Instruction Reuse , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[21]  John Paul Shen,et al.  Instruction path coprocessors , 2000, ISCA '00.

[22]  Scott Mahlke,et al.  Effective compiler support for predicated execution using the hyperblock , 1992, MICRO 1992.

[23]  Yale N. Patt,et al.  Variable length path branch prediction , 1998, ASPLOS VIII.

[24]  Vasanth Bala,et al.  Transparent Dynamic Optimization: The Design and Implementation of Dynamo , 1999 .

[25]  S. McFarling Combining Branch Predictors , 1993 .

[26]  Eric Rotenberg,et al.  Trace cache: a low latency approach to high bandwidth instruction fetching , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[27]  R. Nair,et al.  Exploiting Instruction Level Parallelism In Processors By Caching Scheduled Groups , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[28]  Ravi Nair,et al.  Exploiting Instruction Level Parallelism in Processors by Caching Scheduled Groups , 1997, ISCA.