Parallelization of Control Recurrences for ILP Processors

The performance of applications executing on processors with instruction level parallelism is often limited by control and data dependences. Performance bottlenecks caused by dependences can frequently be eliminated through transformations which reduce the height of critical paths through the program. The utility of these techniques can be demonstrated in an increasingly broad range of important situations. This paper focuses on the height reduction of control recurrences within loops with data dependent exits. Loops with exits are transformed so as to alleviate performance bottlenecks resulting from control dependences. A compilation approach to effect these transformations is described. The techniques presented in this paper used in combination with prior work on reducing the height of data dependences provide a comprehensive approach to accelerating loops with conditional exits. In many cases, loops with conditional exits provide a degree of parallelism traditionally associated with vectorization. Multiple iterations of a loop can be retired in a single cycle on a processor with adequate instruction level parallelism with no cost in code redundancy. In more difficult cases, height reduction requires redundant computation or may not be feasible.

[1]  Norman P. Jouppi,et al.  Available instruction-level parallelism for superscalar and superpipelined machines , 1989, ASPLOS III.

[2]  Peter Y.-T. Hsu,et al.  Overlapped loop support in the Cydra 5 , 1989, ASPLOS III.

[3]  F. H. Mcmahon,et al.  The Livermore Fortran Kernels: A Computer Test of the Numerical Performance Range , 1986 .

[4]  James C. Dehnert,et al.  Overlapped loop support in the Cydra 5 , 1989, ASPLOS 1989.

[5]  B. Ramakrishna Rau Cydra 5 directed dataflow architecture , 1988, Digest of Papers. COMPCON Spring 88 Thirty-Third IEEE Computer Society International Conference.

[6]  Mike Schlansker,et al.  Parallelization of loops with exits on pipelined architectures , 1990, Proceedings SUPERCOMPUTING '90.

[7]  Alexandru Nicolau Parallelism, memory anti-aliasing and correctness for trace scheduling compilers (disambiguation, flow-analysis, compaction) , 1984 .

[8]  John R. Ellis,et al.  Bulldog: A Compiler for VLIW Architectures , 1986 .

[9]  Joseph A. Fisher,et al.  Trace Scheduling: A Technique for Global Microcode Compaction , 1981, IEEE Transactions on Computers.

[10]  B. Ramakrishna Rau,et al.  The Cydra 5 departmental supercomputer: design philosophies, decisions, and trade-offs , 1989, Computer.

[11]  B. R. Rau,et al.  Code Generation Schemas for Modulo Scheduled DO-Loops and WHILE-Loops , 1992 .

[12]  Scott A. Mahlke,et al.  Reverse If-Conversion , 1993, PLDI '93.

[13]  Edward S. Davidson,et al.  Highly concurrent scalar processing , 1986, ISCA 1986.

[14]  Vinod Kathail,et al.  Acceleration of First and Higher Order Recurrences on Processors with Instruction Level Parallelism , 1993, LCPC.

[15]  James C. Dehnert,et al.  Compiling for the Cydra , 1993, The Journal of Supercomputing.

[16]  B. Ramakrishna Rau,et al.  Data Flow and Dependence Analysis for Instruction Level Parallelism , 1991, LCPC.

[17]  M. Schlansker,et al.  On Predicated Execution , 1991 .

[18]  Robert P. Colwell,et al.  A VLIW architecture for a trace scheduling compiler , 1987, ASPLOS 1987.

[19]  Alexandru Nicolau,et al.  Measuring the Parallelism Available for Very Long Instruction Word Architectures , 1984, IEEE Transactions on Computers.

[20]  David W. Wall,et al.  Limits of instruction-level parallelism , 1991, ASPLOS IV.

[21]  Woody Lichtenstein,et al.  The multiflow trace scheduling compiler , 1993, The Journal of Supercomputing.

[22]  Kemal Ebcioglu,et al.  An efficient resource-constrained global scheduling technique for superscalar and VLIW processors , 1992, MICRO 1992.

[23]  B. Ramakrishna Rau,et al.  Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing , 1981, MICRO 14.

[24]  Robert P. Colwell,et al.  A VLIW architecture for a trace scheduling compiler , 1987, ASPLOS.

[25]  David L. Kuck,et al.  The Structure of Computers and Computations , 1978 .

[26]  Scott A. Mahlke,et al.  The superblock: An effective technique for VLIW and superscalar compilation , 1993, The Journal of Supercomputing.

[27]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1984, TOPL.

[28]  Norman P. Jouppi,et al.  Available instruction-level parallelism for superscalar and superpipelined machines , 1989, ASPLOS 1989.

[29]  Vinod Kathail,et al.  Height reduction of control recurrences for ILP processors , 1994, Proceedings of MICRO-27. The 27th Annual IEEE/ACM International Symposium on Microarchitecture.

[30]  Monica S. Lam,et al.  Limits of control flow on parallelism , 1992, ISCA '92.

[31]  Scott Mahlke,et al.  Sentinel scheduling: a model for compiler-controlled speculative execution , 1993 .

[32]  Scott A. Mahlke,et al.  Effective compiler support for predicated execution using the hyperblock , 1992, MICRO 25.