Optimal weighted loop fusion for parallel programs

Nimrod Megiddo Vivek Sarkar IBM Almaden Research Center MIT Laboratory for Computer Science and Tel Aviv University and IBM Software Solutions Division Email: rnegiddo@almaden. ibm. com Email: vivek@lcs. init. edu Much of the computation involved in parallel programs occurs within loops, either nested loops as in parallel scientific applications or collections of loops as in stream-based applicat ions. Loop fusion is a well-known program t ransformation that has shown to be effective in improving data locality in parallel programs by reducing inter-processor communication and improving register and cache locality. Weighted loop fusion is the problem of finding a legal partition of loop nests into fusible clusters so as to minimize the total inter-cluster weights. The loop nests may contain pamdlel or sequential loops; care is taken to ensure that a parallel loop does not get serialized after fusion. It has been shown in past work that the weighted loop fusion problem is NP-hard. Despite the NP-hardness property, we show how optimal solutions can be found efficiently (i. e., within the compile-time constraints of a product-quality optimizing compiler) for weighted loop fusion problem sizes that occur in practice. In this paper, we present an integer programming formulation for weighted loop fusion with size (number of variables and constraints) that is linearly proportional to the size of the input weighted loop fusion problem. The linearized formulation is key to making the execution time small enough for use in a product-quality optimizing compiler, since the natural integer programming formulation for this problem has cubic size for which the execution time would be too large to be practical. The linear-sized integer programming formulation can be solved efficiently using any standard optimization package but we also present a custom branch-and-bound algorithm that can be used if greater efficiency is desired. A prototype implementation of this approach has been completed, and preliminary compile-time measurements are included in the paper as validation of the practicality of this approach. Permission to make digitalfluwd copies of nll or pm-tof this n)~teri~l Iiw personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or conmwrcitd adwmlagc, the copy. right notice. the title ot’tk puhlica[ ion find its date appeor. WN1no[ice is lChI, [nc. ‘1’0 copy olherww. given that copyrighl is bv pemlission oFIIw ,. to republish, 10post on servers at’ to I cdist fibulc 10 I IsL\. requires specific permission and/or fee SFAA 97 NewporL Rhode Island ( ISA Copyright 1997 ACM 0-8979 I-X9(J-W9706 .$3.50

[1]  V. Sarkar,et al.  Automatic partitioning of a program dependence graph into parallel tasks , 1991, IBM J. Res. Dev..

[2]  Allen Goldberg,et al.  Stream processing , 1984, LFP '84.

[3]  Ken Kennedy,et al.  Typed Fusion with Applications to Parallel and Sequential Code Generation , 1994 .

[4]  Ken Kennedy,et al.  Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.

[5]  Vivek Sarkar,et al.  Automatic selection of high-order transformations in the IBM XL FORTRAN compilers , 1997, IBM J. Res. Dev..

[6]  Ken Kennedy,et al.  Automatic translation of FORTRAN programs to vector form , 1987, TOPL.

[7]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1987, TOPL.

[8]  Vivek Sarkar,et al.  A general framework for iteration-reordering loop transformations , 1992, PLDI '92.

[9]  John Randal Allen,et al.  Dependence analysis for subscripted variables and its application to program transformations , 1983 .

[10]  Laurence A. Wolsey,et al.  Integer and Combinatorial Optimization , 1988 .

[11]  Vivek Sarkar,et al.  Partitioning and Scheduling Parallel Programs for Multiprocessing , 1989 .

[12]  Michael Metcalf,et al.  Fortran 90 Explained , 1990 .

[13]  Ii C. D. Callahan A global approach to detection of parallelism , 1987 .

[14]  Utpal Banerjee,et al.  Dependence analysis for supercomputing , 1988, The Kluwer international series in engineering and computer science.

[15]  Vivek Sarkar,et al.  Determining average program execution times and their variance , 1989, PLDI '89.

[16]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[17]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[18]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1984, TOPL.

[19]  Monica S. Lam,et al.  A Loop Transformation Theory and an Algorithm to Maximize Parallelism , 1991, IEEE Trans. Parallel Distributed Syst..

[20]  SarkarVivek,et al.  A general framework for iteration-reordering loop transformations , 1992 .

[21]  Vivek Sarkar,et al.  On Estimating and Enhancing Cache Effectiveness , 1991, LCPC.

[22]  V. Sarkar,et al.  Collective Loop Fusion for Array Contraction , 1992, LCPC.

[23]  Vivek Sarkar,et al.  Automatic parallelization for symmetric shared-memory multiprocessors , 1996, CASCON.