A system level perspective on branch architecture performance

Current compilers for VLIW and superscalar machines increase the instruction level parallelism of an application by merging several basic blocks into an enlarged predicated block, resulting in higher performance code but increased register requirements. We present a framework that computes precisely the interferences among virtual registers in the presence of predicated operations. Graph-coloring biased register allocators can directly use the resulting interference graph. For interval-graph based register allocators, and others, we propose a technique that reduces the register requirements by allowing non-interfering virtual registers that overlap in time to share a common virtual register. Preliminary measurements on a benchmark of loops from the Perfect Club, SPEC-89, and the Livermore Fortran Kernels indicate the effectiveness of this technique.

[1]  Yale N. Patt,et al.  On tuning the microarchitecture of an HPS implementation of the VAX , 1987, MICRO 20.

[2]  Yale N. Patt,et al.  HPS, a new microarchitecture: rationale and introduction , 1985, MICRO 18.

[3]  Scott Mahlke,et al.  Effective compiler support for predicated execution using the hyperblock , 1992, MICRO 1992.

[4]  Yale N. Patt,et al.  Critical issues regarding HPS, a high performance microarchitecture , 1985, MICRO 18.

[5]  Arthur B. Maccabe,et al.  The program dependence web: a representation supporting control-, data-, and demand-driven interpretation of imperative languages , 1990, PLDI '90.

[6]  ScalesHunter,et al.  Single instruction stream parallelism is greater than two , 1991 .

[7]  Dirk Grunwald,et al.  Next cache line and set prediction , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[8]  Michael Shebanow,et al.  Single instruction stream parallelism is greater than two , 1991, ISCA '91.

[9]  B. Ramakrishna Rau,et al.  Efficient code generation for horizontal architectures: Compiler techniques and architectural support , 1982, ISCA '82.

[10]  Ravi Nair,et al.  Optimal 2-Bit Branch Predictors , 1995, IEEE Trans. Computers.

[11]  Shlomit S. Pinter,et al.  Register allocation with instruction scheduling: a new approach , 1996, Journal of Programming Languages.

[12]  S. McFarling Combining Branch Predictors , 1993 .

[13]  Edward S. Davidson,et al.  Register requirements of pipelined processors , 1992, ICS '92.

[14]  B. Ramakrishna Rau,et al.  Iterative modulo scheduling: an algorithm for software pipelining loops , 1994, MICRO 27.

[15]  Yale N. Patt,et al.  A comprehensive instruction fetch mechanism for a processor supporting speculative execution , 1992, MICRO 1992.

[16]  Ken Kennedy,et al.  Conversion of control dependence to data dependence , 1983, POPL '83.

[17]  Alan Jay Smith,et al.  Branch Prediction Strategies and Branch Target Buffer Design , 1995, Computer.

[18]  Shlomit S. Pinter,et al.  Register allocation with instruction scheduling , 1993, PLDI '93.

[19]  Alexandre E. Eichenberger,et al.  Stage scheduling: a technique to reduce the register requirements of a modulo schedule , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[20]  D. Grunwald,et al.  Fast & Accurate Instruction Fetch and Branch Prediction , 1994 .

[21]  Alan Eustace,et al.  ATOM - A System for Building Customized Program Analysis Tools , 1994, PLDI.

[22]  Yale N. Patt,et al.  Run-time generation of HPS microinstructions from a VAX instruction stream , 1986, MICRO 19.

[23]  Alexandre E. Eichenberger,et al.  Stage scheduling: a technique to reduce the register requirements of a module schedule , 1995, MICRO 1995.

[24]  Manoj Franklin,et al.  A fill-unit approach to multiple instruction issue , 1994, Proceedings of MICRO-27. The 27th Annual IEEE/ACM International Symposium on Microarchitecture.

[25]  Edward S. Davidson,et al.  Highly concurrent scalar processing , 1986, ISCA 1986.

[26]  S. McFarling,et al.  Reducing the cost of branches , 1986, ISCA '86.

[27]  Chris H. Perleberg,et al.  Branch Target Buffer Design and Optimization , 1993, IEEE Trans. Computers.

[28]  Dirk Grunwald,et al.  Fast and accurate instruction fetch and branch prediction , 1994, ISCA '94.

[29]  Guang R. Gao,et al.  A Register Allocation Framework Based on Hierarchical Cyclic Interval Graphs , 1992, CC.

[30]  Richard E. Kessler,et al.  Page placement algorithms for large real-indexed caches , 1992, TOCS.

[31]  S. Peter Song,et al.  The PowerPC 604 RISC microprocessor. , 1994, IEEE Micro.

[32]  Yale N. Patt,et al.  Alternative Implementations of Two-Level Adaptive Branch Prediction , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[33]  Alexandre E. Eichenberger,et al.  Minimum register requirements for a modulo schedule , 1994, MICRO 27.

[34]  B. Ramakrishna Rau,et al.  Register allocation for software pipelined loops , 1992, PLDI '92.

[35]  Gregory J. Chaitin,et al.  Register allocation and spilling via graph coloring , 2004, SIGP.

[36]  Brian N. Bershad,et al.  Avoiding conflict misses dynamically in large direct-mapped caches , 1994, ASPLOS VI.

[37]  Grant E. Haab,et al.  Enhanced Modulo Scheduling For Loops With Conditional Branches , 1992, [1992] Proceedings the 25th Annual International Symposium on Microarchitecture MICRO 25.

[38]  Alexandre E. Eichenberger,et al.  Optimum modulo schedules for minimum register requirements , 1995 .

[39]  S. L. Zelen Rationale and Introduction , 1987 .

[40]  Alexander Aiken,et al.  A Development Environment for Horizontal Microcode , 1986, IEEE Trans. Software Eng..

[41]  Joseph T. Rahmeh,et al.  Improving the accuracy of dynamic branch prediction using branch correlation , 1992, ASPLOS V.

[42]  Dirk Grunwald,et al.  Reducing branch costs via branch alignment , 1994, ASPLOS VI.

[43]  Yale N. Patt,et al.  Hardware Support For Large Atomic Units in Dynamically Scheduled Machines , 1988, [1988] Proceedings of the 21st Annual Workshop on Microprogramming and Microarchitecture - MICRO '21.

[44]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[45]  M. Schlansker,et al.  On Predicated Execution , 1991 .

[46]  Richard A. Huff,et al.  Lifetime-sensitive modulo scheduling , 1993, PLDI '93.

[47]  Yale N. Patt,et al.  A comparison of dynamic branch predictors that use two levels of branch history , 1993, ISCA '93.

[48]  B. Ramakrishna Rau,et al.  Efficient code generation for horizontal architectures: Compiler techniques and architectural support , 1982, ISCA '82.

[49]  David A. Padua,et al.  Gated SSA-based demand-driven symbolic analysis for parallelizing compilers , 1995, ICS '95.

[50]  Yale N. Patt,et al.  A comprehensive instruction fetch mechanism for a processor supporting speculative execution , 1992, MICRO 25.

[51]  Prithviraj Banerjee,et al.  Advanced compilation techniques in the PARADIGM compiler for distributed-memory multicomputers , 1995, ICS '95.

[52]  Vinod Kathail,et al.  Height reduction of control recurrences for ILP processors , 1994, Proceedings of MICRO-27. The 27th Annual IEEE/ACM International Symposium on Microarchitecture.