Path-Selection Heuristics for Dominator-Path Scheduling

Virtually all programs exhibit some form of parallelism. Instruction-level parallel (ILP) architectures are capable of exploiting the fine-grained parallelism inherent in most programs by simultaneously executing multiple operations from the same instruction stream. This type of architecture requires special compilation and optimization techniques; one such optimization, instruction scheduling, reorders the instructions in a function in an effort to overlap their latency and thereby speed up the entire function. Dominator-Path Scheduling (DPS), a popular scheduling method, moves code along a selected path in the dominator tree. There are many ways to select these paths; this thesis presents an investigation of the characteristics of good choices of dominator paths for use in DPS. No single heuristic performed well enough to use on its own, but a combined heuristic can do very well in almost all cases. I recommend a combination of three heuristics: limiting paths to one nesting level of a single loop, grouping sparse blocks together, and including non-postdominating blocks. This combined heuristic achieved the best possible schedules in the benchmarks bubble and matrix_mult and within 4% of best in gauss. For bubble, this heuristic was 24% better than DPS’s previous method of choosing dominator paths. The combined heuristic is likely to do as well as is possible in most functions.

[1]  Philip H. Sweany,et al.  Dominator-path scheduling: a global scheduling method , 1992, MICRO.

[2]  S. J. Beaty,et al.  List scheduling: alone, with foresight, and with lookahead , 1994, Proceedings of the First International Conference on Massively Parallel Computing Systems (MPCS) The Challenges of General-Purpose and Special-Purpose Computing.

[3]  ScalesHunter,et al.  Single instruction stream parallelism is greater than two , 1991 .

[4]  J A Fisher,et al.  Instruction-Level Parallel Processing , 1991, Science.

[5]  F. H. Mcmahon,et al.  The Livermore Fortran Kernels: A Computer Test of the Numerical Performance Range , 1986 .

[6]  Philip H. Sweany,et al.  Trace scheduling optimization in a retargetable microcode compiler , 1988, SIGM.

[7]  Michael Rodeh,et al.  Global instruction scheduling for superscalar machines , 1991, PLDI '91.

[8]  John H. Reif Symbolic program analysis in almost linear time , 1978, POPL '78.

[9]  David A. Poplawski The unlimited resource machine (urm) , 1995 .

[10]  Joseph A. Fisher,et al.  Trace Scheduling: A Technique for Global Microcode Compaction , 1981, IEEE Transactions on Computers.

[11]  Michael J. Flynn,et al.  Detection and Parallel Execution of Independent Instructions , 1970, IEEE Transactions on Computers.

[12]  Joseph L. Linn,et al.  SRDAG compaction: a generalization of trace scheduling to increase the use of global context information , 1983, SIGM.

[13]  John R. Ellis,et al.  Bulldog: A Compiler for VLIW Architectures , 1986 .

[14]  Steven Beaty,et al.  Sampling of submitted papers , 1990, SIGM.

[15]  Augustus K. Uht,et al.  Extraction of massive instruction level parallelism , 1993, CARN.

[16]  Brian A. Wichmann,et al.  A Synthetic Benchmark , 1976, Comput. J..

[17]  Alexandru Nicolau,et al.  Measuring the Parallelism Available for Very Long Instruction Word Architectures , 1984, IEEE Transactions on Computers.

[18]  Norman P. Jouppi,et al.  Available instruction-level parallelism for superscalar and superpipelined machines , 1989, ASPLOS III.