Parallelizing nonnumerical code with selective scheduling and software pipelining

Instruction-level parallelism (ILP) in nonnumerical code is regarded as scarce and hard to exploit due to its irregularity. In this article, we introduce a new code-scheduling technique for irregular ILP called “selective scheduling” which can be used as a component for superscalar and VLIW compilers. Selective scheduling can compute a wide set of independent operations across all execution paths based on renaming and forward-substitution and can compute available operations across loop iterations if combined with software pipelining. This scheduling approach has better heuristics for determining the usefulness of moving one operation versus moving another and can successfully find useful code motions without resorting to branch profiling. The compile-time overhead of selective scheduling is low due to its incremental computation technique and its controlled code duplication. We parallelized the SPEC integer benchmarks and five AIX utilities without using branch probabilities. The experiments indicate that a fivefold speedup is achievable on realistic resources with a reasonable overhead in compilation time and code expansion and that a solid speedup increase is also obtainable on machines with fewer resources. These results improve previously known characteristics of irregular ILP.

[1]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1987, TOPL.

[2]  Toshio Nakatani,et al.  Making Compaction-Based Parallelization Affordable , 1993, IEEE Trans. Parallel Distributed Syst..

[3]  David A. Padua,et al.  Advanced compiler optimizations for supercomputers , 1986, CACM.

[4]  Kemal Ebcioglu,et al.  VLIW compilation techniques in a superscalar environment , 1994, PLDI '94.

[5]  Thomas R. Gross,et al.  Avoidance and suppression of compensation code in a trace scheduling compiler , 1994, TOPL.

[6]  Toshio Nakatani,et al.  “Combining” as a compilation technique for VLIW architectures , 1989, MICRO 22.

[7]  Soo-Mook Moon,et al.  Generalized Multiway Branch Unit for VLIW Microprocessors , 1995, IEEE Trans. Parallel Distributed Syst..

[8]  John R. Ellis,et al.  Bulldog: A Compiler for VLIW Architectures , 1986 .

[9]  Soo-Mook Moon Increasing cache bandwidth using multi-port caches for exploiting ILP in non-numerical code , 1995, PACT.

[10]  John R. Ellis,et al.  Bulldog: a compiler for vliw architectures (parallel computing, reduced-instruction-set, trace scheduling, scientific) , 1985 .

[11]  B. Ramakrishna Rau,et al.  The Cydra 5 departmental supercomputer: design philosophies, decisions, and trade-offs , 1989, Computer.

[12]  Soo-Mook Moon Compile-time parallelization of non-numerical code: VLIW superscalar , 1993 .

[13]  Mark N. Wegman,et al.  Efficiently computing static single assignment form and the control dependence graph , 1991, TOPL.

[14]  Alexandru Nicolau,et al.  Percolation scheduling with resource constraints , 1989 .

[15]  MoonSoo-Mook,et al.  Parallelizing nonnumerical code with selective scheduling and software pipelining , 1997 .

[16]  Michael D. Smith,et al.  Limits on multiple instruction issue , 1989, ASPLOS III.

[17]  Soo-Mook Moon,et al.  Evaluation of scheduling techniques on a SPARC-based VLIW testbed , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[18]  Richard L. Sites,et al.  Alpha AXP architecture , 1993, CACM.

[19]  Michael Rodeh,et al.  Global instruction scheduling for superscalar machines , 1991, PLDI '91.

[20]  David A. Patterson,et al.  Reduced instruction set computers , 1985, CACM.

[21]  Dirk Grunwald,et al.  Performance issues in correlated branch prediction schemes , 1995, MICRO 1995.

[22]  Robert P. Colwell,et al.  A VLIW architecture for a trace scheduling compiler , 1987, ASPLOS.

[23]  Suneel Jain,et al.  Circular scheduling: a new technique to perform software pipelining , 1991, PLDI '91.

[24]  Michael D. Smith,et al.  Efficient superscalar performance through boosting , 1992, ASPLOS V.

[25]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[26]  Soo-Mook Moon,et al.  Performance analysis of tree VLIW architecture for exploiting branch ILP in non-numerical code , 1997, ICS '97.

[27]  Toshio Nakatani,et al.  A new compilation technique for parallelizing loops with unpredictable branches on a VLIW architecture , 1990 .

[28]  B. Ramakrishna Rau,et al.  Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing , 1981, MICRO 14.

[29]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1984, TOPL.

[30]  Joseph A. Fisher,et al.  Predicting conditional branch directions from previous runs of a program , 1992, ASPLOS V.

[31]  Rajiv Gupta,et al.  Region Scheduling: An Approach for Detecting and Redistributing Parallelism , 1990, IEEE Trans. Software Eng..

[32]  M. Atkins Performance and the i860 microprocessor , 1991, IEEE Micro.

[33]  Alexander Aiken,et al.  A Development Environment for Horizontal Microcode , 1986, IEEE Trans. Software Eng..

[34]  Norman P. Jouppi,et al.  Available instruction-level parallelism for superscalar and superpipelined machines , 1989, ASPLOS III.

[35]  Kemal Ebcioglu,et al.  An architectural framework for supporting heterogeneous instruction-set architectures , 1993, Computer.