Removing Impediments to Loop Fusion Through Code Transformations

Loop fusion is a common optimization technique that takes several loops and combines them into a single large loop. Most of the existing work on loop fusion concentrates on the heuristics required to optimize an objective function, such. as data reuse or creation of instruction level parallelism opportunities. Often, however, the code provided to a compiler has only small sets of loops that are control flow equivalent, normalized, have the same iteration count, are adjacent, and have no fusion-preventing dependences. This paper focuses on code transformations that create more opportunities for loop fusion in the IBM®XL compiler suite that generates code for the IBM family of PowerPC®processors. In this compiler an objective function is used at the loop distributor to decide which portions of a loop should remain in the same loop nest and which portions should be redistributed. Our algorithm focuses on eliminating conditions that prevent loop fusion. By generating maximal fusion our algorithm increases the scope of later transformations. We tested our improved code generator in an IBM pSeries 690 machine equipped with a POWER4 processor using the SPEC CPU2000 benchmark suite. Our improvements to loop fusion resulted in three times as many loops fused in a subset of CFP2000 benchmarks, and four times as many for a subset of CINT2000 benchmarks.

[1]  Yoichi Muraoka,et al.  Parallelism exposure and exploitation in programs , 1971 .

[2]  David J. Kuck,et al.  A Survey of Parallel Machine Organization and Programming , 1977, CSUR.

[3]  Monica S. Lam,et al.  Blocking and array contraction across arbitrarily nested loops using affine partitioning , 2001, PPoPP '01.

[4]  V. Sarkar,et al.  Collective Loop Fusion for Array Contraction , 1992, LCPC.

[5]  Michael Hind,et al.  Loop distribution with multiple exits , 1992, Proceedings Supercomputing '92.

[6]  Kathryn S. McKinley,et al.  A Parametrized Loop Fusion Algorithm for Improving Parallelism and Cache Locality , 1997, Comput. J..

[7]  Ken Kennedy,et al.  Loop distribution with arbitrary control flow , 1990, Proceedings SUPERCOMPUTING '90.

[8]  Rajiv Gupta,et al.  Adaptive loop transformations for scientific programs , 1995, Proceedings.Seventh IEEE Symposium on Parallel and Distributed Processing.

[9]  Ken Kennedy,et al.  The memory of bandwidth bottleneck and its amelioration by a compiler , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[10]  Vivek Sarkar,et al.  Optimal weighted loop fusion for parallel programs , 1997, SPAA '97.

[11]  David F. Bacon,et al.  Compiler transformations for high-performance computing , 1994, CSUR.

[12]  Ken Kennedy,et al.  Typed Fusion with Applications to Parallel and Sequential Code Generation , 1994 .

[13]  Ken Kennedy,et al.  Improving effective bandwidth through compiler enhancement of global cache reuse , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[14]  Ken Kennedy,et al.  Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.