Partial resolution in branch target buffers

Compile-time reordering of low level instructions is successful in achieving large increases in performance of programs on fine-grain parallel machines. However, because of the interdependences between instruction scheduling rand register allocation, a lack of cooperation between the schedules and register allocator can result in generating code that contains excess register spills and/or a lower degree of parallelism than actually achievable. This paper describes a strategy for providing cooperation between register allocation and both global and local instruction scheduling. We experimentally compare this strategy with other cooperative and uncooperative scenarios. Our experiments indicate that the greatest speedups are obtained by performing either cooperative or uncooperative global instruction scheduling with cooperative register allocation and local instruction scheduling.

[1]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[2]  Donald J. Hatfield,et al.  Program Restructuring for Virtual Memory , 1971, IBM Syst. J..

[3]  Domenico Ferrari,et al.  Improving locality by critical working sets , 1974, CACM.

[4]  Duncan H. Lawrie,et al.  On the Performance Enhancement of Paging Systems Through Program Analysis and Transformations , 1981, IEEE Transactions on Computers.

[5]  Joseph A. Fisher,et al.  Trace Scheduling: A Technique for Global Microcode Compaction , 1981, IEEE Transactions on Computers.

[6]  John Cocke,et al.  Register Allocation Via Coloring , 1981, Comput. Lang..

[7]  David Kroft,et al.  Lockup-free instruction fetch/prefetch cache organization , 1998, ISCA '81.

[8]  Thomas R. Gross,et al.  Postpass Code Optimization of Pipeline Constraints , 1983, TOPL.

[9]  Alan Jay Smith,et al.  Branch Prediction Strategies and Branch Target Buffer Design , 1995, Computer.

[10]  S. McFarling,et al.  Reducing the cost of branches , 1986, ISCA '86.

[11]  Efficient instruction scheduling for a pipelined architecture , 1986, SIGPLAN Symposium on Compiler Construction.

[12]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1987, TOPL.

[13]  Mark D. Hill,et al.  Aspects of Cache Memory and Instruction , 1987 .

[14]  Anant Agarwal,et al.  On-Chip Instruction Caches for High Performance Processors, , 1987 .

[15]  Alan Jay Smith,et al.  Aspects of cache memory and instruction buffer performance , 1987 .

[16]  James E. Smith,et al.  A study of scalar compilation techniques for pipelined supercomputers , 1987, ASPLOS.

[17]  Monica Lam Software pipelining: an effective scheduling technique for VLIW machines , 1988, ACM-SIGPLAN Symposium on Programming Language Design and Implementation.

[18]  Michel Dubois,et al.  Concurrent Miss Resolution in Multiprocessor Caches , 1988, ICPP.

[19]  Stephen J. Hartley Compile-Time Program Restructuring in Multiprogrammed Virtual Memory Systems , 1988, IEEE Trans. Software Eng..

[20]  David Bernstein,et al.  An Improved Approximation Algorithm for Scheduling Pipelined Machines , 1988, International Conference on Parallel Processing.

[21]  J.P. Costello,et al.  Design tradeoffs for a 40 MIPS (peak) CMOS 32-bit microprocessor , 1988, Proceedings 1988 IEEE International Conference on Computer Design: VLSI.

[22]  Monica S. Lam,et al.  RETROSPECTIVE : Software Pipelining : An Effective Scheduling Technique for VLIW Machines , 1998 .

[23]  Yale N. Patt,et al.  Hardware Support For Large Atomic Units in Dynamically Scheduled Machines , 1988, [1988] Proceedings of the 21st Annual Workshop on Microprogramming and Microarchitecture - MICRO '21.

[24]  Ken Kennedy,et al.  Blocking Linear Algebra Codes for Memory Hierarchies , 1989, PPSC.

[25]  W. W. Hwu,et al.  Achieving high instruction cache performance with an optimizing compiler , 1989, ISCA '89.

[26]  Peter Steenkiste,et al.  A simple interprocedural register allocation algorithm and its effectiveness for LISP , 1989, TOPL.

[27]  Alan Jay Smith,et al.  Evaluating Associativity in CPU Caches , 1989, IEEE Trans. Computers.

[28]  Rajiv Gupta,et al.  Register allocation via clique separators , 1989, PLDI '89.

[29]  David R. Stiles,et al.  Pipeline control for a single cycle VLSI implementation of a complex instruction set computer , 1989, Digest of Papers. COMPCON Spring 89. Thirty-Fourth IEEE Computer Society International Conference: Intellectual Leverage.

[30]  Charles N. Fischer,et al.  On the Minimization of Loads/Stores in Local Register Allocation , 1989, IEEE Transactions on Software Engineering.

[31]  Andrew R. Pleszkun,et al.  Improving Performance Of Small On-chip Instruction Caches , 1989, The 16th Annual International Symposium on Computer Architecture.

[32]  Paul Chow,et al.  Mips-X RISC Microprocessor , 1989 .

[33]  Ken Kennedy,et al.  Software methods for improvement of cache performance on supercomputer applications , 1989 .

[34]  Karl Pettis,et al.  Profile guided code positioning , 1990, PLDI '90.

[35]  Rajiv Gupta,et al.  Improving instruction cache behavior by reducing cache pollution , 1990, Proceedings SUPERCOMPUTING '90.

[36]  John L. Hennessy,et al.  The priority-based coloring approach to register allocation , 1990, TOPL.

[37]  Rajiv Gupta,et al.  Region Scheduling: An Approach for Detecting and Redistributing Parallelism , 1990, IEEE Trans. Software Eng..

[38]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[39]  Steven A. Przybylski,et al.  Cache and memory hierarchy design: a performance-directed approach , 1990 .

[40]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[41]  Henry M. Levy,et al.  An Architecture for Software-Controlled Data Prefetching , 1991, ISCA.

[42]  Vivek Sarkar,et al.  On Estimating and Enhancing Cache Effectiveness , 1991, LCPC.

[43]  Scott A. Mahlke,et al.  Data access microarchitectures for superscalar processors with compiler-assisted data prefetching , 1991, MICRO 24.

[44]  Michael Rodeh,et al.  Global instruction scheduling for superscalar machines , 1991, PLDI '91.

[45]  Scott McFarling,et al.  Procedure merging with instruction caches , 1991, PLDI '91.

[46]  Wen-mei W. Hwu,et al.  IMPACT: an architectural framework for multiple-instruction-issue processors , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[47]  Ken Kennedy,et al.  Software prefetching , 1991, ASPLOS IV.

[48]  Janak H. Patel,et al.  Data prefetching in multiprocessor vector cache memories , 1991, ISCA '91.

[49]  Gurindar S. Sohi,et al.  High-bandwidth data memory systems for superscalar processors , 1991, ASPLOS IV.

[50]  Michael D. Smith,et al.  Tracing with Pixie , 1991 .

[51]  Anoop Gupta,et al.  Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..

[52]  Mark N. Wegman,et al.  Efficiently computing static single assignment form and the control dependence graph , 1991, TOPL.

[53]  Susan J. Eggers,et al.  Integrating register allocation and instruction scheduling for RISCs , 1991, ASPLOS IV.

[54]  Scott A. Mahlke,et al.  Using profile information to assist classic code optimizations , 1991, Softw. Pract. Exp..

[55]  Youfeng Wu Ordering functions for improving memory reference locality in a shared memory multiprocessor system , 1992, MICRO 25.

[56]  Guang R. Gao,et al.  A Register Allocation Framework Based on Hierarchical Cyclic Interval Graphs , 1992, CC.

[57]  Harvey G. Cragon,et al.  Branch strategy taxonomy and performance models , 1991, IEEE computer society press monograph.

[58]  Joseph T. Rahmeh,et al.  Improving the accuracy of dynamic branch prediction using branch correlation , 1992, ASPLOS V.

[59]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[60]  Thomas Martin Conte,et al.  Systematic Computer Architecture Prototyping , 1992 .

[61]  Scott Mahlke,et al.  Effective compiler support for predicated execution using the hyperblock , 1992, MICRO 1992.

[62]  Rajiv Gupta,et al.  URSA: A Unified ReSource Allocator for Registers and Functional Units in VLIW Architectures , 1993, Architectures and Compilation Techniques for Fine and Medium Grain Parallelism.

[63]  Bantwal R. Rau Dynamically scheduled VLIW processors , 1993, MICRO 1993.

[64]  Dionisios N. Pnevmatikatos,et al.  Cache performance of the SPEC92 benchmark suite , 1993, IEEE Micro.

[65]  Ken Kennedy,et al.  Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.

[66]  Chris H. Perleberg,et al.  Branch Target Buffer Design and Optimization , 1993, IEEE Trans. Computers.

[67]  Mikko H. Lipasti,et al.  Architecture-compatible code boosting for performance enhancement of the IBM RS/6000 , 1993, Proceedings of 1993 IEEE International Conference on Computer Design ICCD'93.

[68]  Brian N. Bershad,et al.  The impact of operating system structure on memory system performance , 1994, SOSP '93.

[69]  Yale N. Patt,et al.  A Comparison Of Dynamic Branch Predictors That Use Two Levels Of Branch History , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[70]  Mary Jean Harrold,et al.  Load/store range analysis for global register allocation , 1993, PLDI '93.

[71]  Kemal Ebcioglu,et al.  An architectural framework for supporting heterogeneous instruction-set architectures , 1993, Computer.

[72]  Predictability of load/store instruction latencies , 1993, Proceedings of the 26th Annual International Symposium on Microarchitecture.

[73]  Shlomit S. Pinter,et al.  Compile time instruction cache optimizations , 1994, CARN.

[74]  Rajiv Gupta,et al.  Resource Spackling: A Framework for Integrating Register Allocation in Local and Global Schedulers , 1994, IFIP PACT.

[75]  Chau-Wen Tseng,et al.  Compiler optimizations for improving data locality , 1994, ASPLOS VI.

[76]  Todd C. Mowry,et al.  Tolerating latency through software-controlled data prefetching , 1994 .

[77]  Dirk Grunwald,et al.  Reducing branch costs via branch alignment , 1994, ASPLOS VI.

[78]  Manoj Franklin,et al.  A fill-unit approach to multiple instruction issue , 1994, Proceedings of MICRO-27. The 27th Annual IEEE/ACM International Symposium on Microarchitecture.

[79]  Apostolos Dollas,et al.  Predicting and precluding problems with memory latency , 1994, IEEE Micro.

[80]  Randall R. Heisch Trace-directed program restructuring for AIX executables , 1994, IBM J. Res. Dev..

[81]  Jean-Loup Baer,et al.  A performance study of software and hardware data prefetching schemes , 1994, ISCA '94.

[82]  D. Grunwald,et al.  Fast & Accurate Instruction Fetch and Branch Prediction , 1994 .

[83]  Lori L. Pollock,et al.  Register allocation over the program dependence graph , 1994, PLDI '94.

[84]  Ann Marie Grizzaffi Maynard,et al.  Contrasting characteristics and cache performance of technical and multi-user commercial workloads , 1994, ASPLOS VI.

[85]  Dirk Grunwald,et al.  Fast and accurate instruction fetch and branch prediction , 1994, ISCA '94.

[86]  Dawson R. Engler,et al.  DCG: an efficient, retargetable dynamic code generation system , 1994, ASPLOS VI.

[87]  Mauricio J. Serrano,et al.  The impact of unresolved branches on branch prediction scheme performance , 1994, ISCA '94.

[88]  David Keppel,et al.  Shade: a fast instruction-set simulator for execution profiling , 1994, SIGMETRICS.

[89]  Yale N. Patt,et al.  Alternative implementations of hybrid branch predictors , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[90]  David Bernstein,et al.  Compiler techniques for data prefetching on the PowerPC , 1995, PACT.

[91]  B. Fagin Partial Resolution in Branch Target Buffers , 1997, IEEE Trans. Computers.