论文信息 - Path-Selection Heuristics for Dominator-Path Scheduling

Path-Selection Heuristics for Dominator-Path Scheduling

Virtually all programs exhibit some form of parallelism. Instruction-level parallel (ILP) architectures are capable of exploiting the fine-grained parallelism inherent in most programs by simultaneously executing multiple operations from the same instruction stream. This type of architecture requires special compilation and optimization techniques; one such optimization, instruction scheduling, reorders the instructions in a function in an effort to overlap their latency and thereby speed up the entire function. Dominator-Path Scheduling (DPS), a popular scheduling method, moves code along a selected path in the dominator tree. There are many ways to select these paths; this thesis presents an investigation of the characteristics of good choices of dominator paths for use in DPS. No single heuristic performed well enough to use on its own, but a combined heuristic can do very well in almost all cases. I recommend a combination of three heuristics: limiting paths to one nesting level of a single loop, grouping sparse blocks together, and including non-postdominating blocks. This combined heuristic achieved the best possible schedules in the benchmarks bubble and matrix_mult and within 4% of best in gauss. For bubble, this heuristic was 24% better than DPS’s previous method of choosing dominator paths. The combined heuristic is likely to do as well as is possible in most functions.

Brett L. Huber | B. L. Huber

[1] Philip H. Sweany,et al. Dominator-path scheduling: a global scheduling method , 1992, MICRO.

[2] S. J. Beaty,et al. List scheduling: alone, with foresight, and with lookahead , 1994, Proceedings of the First International Conference on Massively Parallel Computing Systems (MPCS) The Challenges of General-Purpose and Special-Purpose Computing.

[3] ScalesHunter,et al. Single instruction stream parallelism is greater than two , 1991 .

[4] J A Fisher,et al. Instruction-Level Parallel Processing , 1991, Science.

[5] F. H. Mcmahon,et al. The Livermore Fortran Kernels: A Computer Test of the Numerical Performance Range , 1986 .

[6] Philip H. Sweany,et al. Trace scheduling optimization in a retargetable microcode compiler , 1988, SIGM.

[7] Michael Rodeh,et al. Global instruction scheduling for superscalar machines , 1991, PLDI '91.

[8] John H. Reif. Symbolic program analysis in almost linear time , 1978, POPL '78.

[9] David A. Poplawski. The unlimited resource machine (urm) , 1995 .

[10] Joseph A. Fisher,et al. Trace Scheduling: A Technique for Global Microcode Compaction , 1981, IEEE Transactions on Computers.

[11] Michael J. Flynn,et al. Detection and Parallel Execution of Independent Instructions , 1970, IEEE Transactions on Computers.

[12] Joseph L. Linn,et al. SRDAG compaction: a generalization of trace scheduling to increase the use of global context information , 1983, SIGM.

[13] John R. Ellis,et al. Bulldog: A Compiler for VLIW Architectures , 1986 .

[14] Steven Beaty,et al. Sampling of submitted papers , 1990, SIGM.

[15] Augustus K. Uht,et al. Extraction of massive instruction level parallelism , 1993, CARN.

[16] Brian A. Wichmann,et al. A Synthetic Benchmark , 1976, Comput. J..

[17] Alexandru Nicolau,et al. Measuring the Parallelism Available for Very Long Instruction Word Architectures , 1984, IEEE Transactions on Computers.

[18] Norman P. Jouppi,et al. Available instruction-level parallelism for superscalar and superpipelined machines , 1989, ASPLOS III.