Automatic Parallelization of Loop Programs for Distributed Memory Architectures

Parallel computers, especially in the form of clusters of standard PCs, have become reasonably cheap within the last few years. It is an obvious desire to use the increased computation power of such parallel hardware in order to speed up any given application. However, for that purpose, these application programs must be transformed such that they take benefit of the parallel hardware. One solution to generate the necessary parallel software is to use automatic parallelization, i.e., a parallelizing compiler. Such a tool takes a program in which nothing is specified about parallelism, and automatically transforms it to a parallel program. This idea allows to introduce parallelism easily, i.e., without much effort and, simultaneously, with guaranteed correctness with respect to the input program. This thesis presents a way to build up such a parallelizing compiler. In order to be efficient, we restrict ourselves to arbitrarily nested loops as the only control structure causing repeated computations. We apply a mathematical model, the polyhedron model, that gives a unified framework for the various parallelization tasks, and that allows a directed search for optimal solutions. We touch nearly every parallelization task and demonstrate the interactions between them. One of the main topics of this thesis is how to extract parallelism of the right granularity: too coarse-grained parallelism might not exploit the parallelism available from the hardware, and too fine-grained parallelism leads to increased overhead, especially to communication overhead. We shall derive a method that allows to precisely adapt the granularity to the given parallel architecture.

[1]  Thomas Kailath,et al.  Regular iterative algorithms and their implementation on processor arrays , 1988, Proc. IEEE.

[2]  Albert Cohen Program Analysis and Transformation: From the Polytope Model to Formal Languages. (Analyse et transformation de programmes: du modèle polyédrique aux langages formels) , 1999 .

[3]  Albert Coheny Jean-Fran Array Data--ow Analysis for Imperative Recursive Programs , 1996 .

[4]  Jingling Xue,et al.  On Tiling as a Loop Transformation , 1997, Parallel Process. Lett..

[5]  Alain Darte,et al.  Automatic Parallelization Based on Multi-Dimensional Scheduling , 1994 .

[6]  David B. Skillicorn,et al.  Questions and Answers about BSP , 1997, Sci. Program..

[7]  Frédéric Vivien,et al.  On the Optimality of Allen and Kennedy's Algorithm for Parallelism Extraction in Nested Loops , 1996, Parallel Algorithms Appl..

[8]  Hiroshi Ohta,et al.  Optimal tile size adjustment in compiling general DOACROSS loop nests , 1995, ICS '95.

[9]  Yves Robert,et al.  Linear Scheduling Is Nearly Optimal , 1991, Parallel Process. Lett..

[10]  Michael Wolfe,et al.  More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[11]  William Pugh,et al.  Eliminating false data dependences using the Omega test , 1992, PLDI '92.

[12]  Sanjay V. Rajopadhye,et al.  Optimal Orthogonal Tiling of 2-D Iterations , 1997, J. Parallel Distributed Comput..

[13]  Frédéric Vivien,et al.  Scheduling the Computations of a Loop Nest with Respect to a Given Mapping , 2000, Euro-Par.

[14]  Aart J. C. Bik,et al.  Automatically exploiting implicit parallelism in Java , 1997, Concurr. Pract. Exp..

[15]  Albert Cohen,et al.  Maximal Static Expansion , 1998, POPL '98.

[16]  Cédric Bastoul,et al.  Efficient code generation for automatic parallelization and optimization , 2003, Second International Symposium on Parallel and Distributed Computing, 2003. Proceedings..

[17]  Paul Feautrier,et al.  Fuzzy Array Dataflow Analysis , 1997, J. Parallel Distributed Comput..

[18]  William Pugh,et al.  The Omega Library interface guide , 1995 .

[19]  Martin Griebl,et al.  Application of the Polytope Model to Functional Programs , 1999, LCPC.

[20]  Patrice Quinton,et al.  The mapping of linear recurrence equations on regular arrays , 1989, J. VLSI Signal Process..

[21]  Martin Griebl,et al.  Code generation in the polytope model , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[22]  John A. Chandy,et al.  Communication Optimizations Used in the Paradigm Compiler for Distributed-Memory Multicomputers , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[23]  Weijia Shang,et al.  On Time Optimal Supernode Shape , 2002, IEEE Trans. Parallel Distributed Syst..

[24]  Leslie Lamport,et al.  The parallel execution of DO loops , 1974, CACM.

[25]  Mohamed Jemni,et al.  On the parallelization of single dynamic conditional loops , 1996, Simul. Pract. Theory.

[26]  Martin Griebl,et al.  A Precise Fixpoint Reaching Definition Analysis for Arrays , 1999, LCPC.

[27]  Geoffrey C. Fox,et al.  A High Level SPMD Programming Model: HPspmd and its Java Language Binding , 1998 .

[28]  Martin Griebl,et al.  Termination detection in parallel loop nests with while loops , 1999, Parallel Comput..

[29]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[30]  Steven K. Feiner,et al.  Introduction to Computer Graphics , 1993 .

[31]  Keshav Pingali,et al.  Tiling Imperfectly-nested Loop Nests (REVISED) , 2000 .

[32]  Zhiyuan Li,et al.  A Compiler Framework for Tiling Imperfectly-Nested Loops , 1999, LCPC.

[33]  Aart J. C. Bik,et al.  Automatically exploiting implicit parallelism in Java , 1997 .

[34]  Doran Wilde,et al.  A LIBRARY FOR DOING POLYHEDRAL OPERATIONS , 2000 .

[35]  Philippe Clauss Counting Solutions to Linear and Nonlinear Constraints Through Ehrhart Polynomials: Applications to Analyze and Transform Scientific Programs , 1996, International Conference on Supercomputing.

[36]  Martin Griebl,et al.  Forward Communication Only Placements and Their Use for Parallel Program Construction , 2002, LCPC.

[37]  Yves Robert,et al.  Mapping Uniform Loop Nests Onto Distributed Memory Architectures , 1993, Parallel Comput..

[38]  Frédéric Vivien,et al.  Optimal Fine and Medium Grain Parallelism Detection in Polyhedral Reduced Dependence Graphs , 2004, International Journal of Parallel Programming.

[39]  Monica S. Lam,et al.  A Loop Transformation Theory and an Algorithm to Maximize Parallelism , 1991, IEEE Trans. Parallel Distributed Syst..

[40]  Katherine A. Yelick,et al.  Titanium: A High-performance Java Dialect , 1998, Concurr. Pract. Exp..

[41]  Larry Carter,et al.  Languages and compilers for parallel computing : 12th International Workshop, LCPC'99, La Jolla, CA, USA, August 4-6, 1999 : proceedings , 2000 .

[42]  Michael Philippsen,et al.  JavaParty - Transparent Remote Objects in Java , 1997, Concurr. Pract. Exp..

[43]  Corinne Ancourt,et al.  Scanning polyhedra with DO loops , 1991, PPOPP '91.

[44]  Utpal Banerjee Loop Parallelization , 1994, Springer US.

[45]  Larry Carter,et al.  Selecting tile shape for minimal execution time , 1999, SPAA '99.

[46]  J. P. Burg,et al.  Maximum entropy spectral analysis. , 1967 .

[47]  Sanjay V. Rajopadhye,et al.  Optimal semi-oblique tiling , 2001, SPAA '01.

[48]  Paul Feautrier,et al.  Dataflow analysis of array and scalar references , 1991, International Journal of Parallel Programming.

[49]  Ian Foster,et al.  Designing and building parallel programs , 1994 .

[50]  Sanjay V. Rajopadhye,et al.  Generation of Efficient Nested Loops from Polyhedra , 2000, International Journal of Parallel Programming.

[51]  David K. Smith Theory of Linear and Integer Programming , 1987 .

[52]  Geoffrey C. Fox Java for computational science and engineering – simulation and modeling II , 1997 .

[53]  M. Birkner,et al.  Blow-up of semilinear PDE's at the critical dimension. A probabilistic approach , 2002 .

[54]  Patrice Quinton,et al.  The ALPHA language and its use for the design of systolic arrays , 1991, J. VLSI Signal Process..

[55]  D.I. Moldovan,et al.  On the design of algorithms for VLSI systolic arrays , 1983, Proceedings of the IEEE.

[56]  Monica S. Lam,et al.  Maximizing Parallelism and Minimizing Synchronization with Affine Partitions , 1998, Parallel Comput..

[57]  Aart J. C. Bik,et al.  Advanced Compiler Optimizations for Sparse Computations , 1995, J. Parallel Distributed Comput..

[58]  Sanjay V. Rajopadhye,et al.  Optimal Orthogonal Tiling , 1998, Euro-Par.

[59]  Carl-Erik Fröberg,et al.  Numerical mathematics - theory and computer applications , 1985 .

[60]  Larry Carter,et al.  Determining the idle time of a tiling , 1997, POPL '97.

[61]  Utpal Banerjee,et al.  Speedup of ordinary programs , 1979 .

[62]  Utpal Banerjee,et al.  Loop Transformations for Restructuring Compilers: The Foundations , 1993, Springer US.

[63]  Laurence A. Wolsey,et al.  Integer and Combinatorial Optimization , 1988 .

[64]  J. Kiefer,et al.  Sequential minimax search for a maximum , 1953 .

[65]  Jean-Francois Collard Code Generation in Automatic Parallelizers , 1994, Applications in Parallel and Distributed Computing.

[66]  Alexander Schrijver,et al.  Theory of linear and integer programming , 1986, Wiley-Interscience series in discrete mathematics and optimization.

[67]  J. Cadzow Maximum Entropy Spectral Analysis , 2006 .

[68]  W. Kelly,et al.  Code generation for multiple mappings , 1995, Proceedings Frontiers '95. The Fifth Symposium on the Frontiers of Massively Parallel Computation.

[69]  Martin Griebl,et al.  Array Dataflow Analysis for Explicitly Parallel Programs , 1996, Euro-Par, Vol. I.

[70]  Paul Feautrier,et al.  Automatic Storage Management for Parallel Programs , 1998, Parallel Comput..

[71]  I N Bronstein,et al.  Taschenbuch der Mathematik , 1966 .

[72]  Thomas Brandes The importance of direct dependences for automatic parallelization , 1988, ICS '88.

[73]  Paul Feautrier,et al.  Fuzzy array dataflow analysis , 1995, PPOPP '95.

[74]  J. Ramanujam,et al.  Non-unimodular transformations of nested loops , 1992, Proceedings Supercomputing '92.

[75]  Martin Griebl,et al.  Array Dataflow Analysis for Explicitly Parallel Programs , 1997, Parallel Process. Lett..

[76]  Robert W. Floyd,et al.  The Language of Machines: an Introduction to Computability and Formal Languages , 1994 .

[77]  Andreas Krall,et al.  Efficient JavaVM just-in-time compilation , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[78]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[79]  J. Ramanujam,et al.  Beyond unimodular transformations , 1995, The Journal of Supercomputing.

[80]  Jack Dongarra,et al.  Automatic Blocking of Nested Loops , 1990 .

[81]  Barbara M. Chapman,et al.  Supercompilers for parallel and vector computers , 1990, ACM Press frontier series.

[82]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[83]  Curtis F. Gerald,et al.  APPLIED NUMERICAL ANALYSIS , 1972, The Mathematical Gazette.

[84]  Yves Robert,et al.  Mapping affine loop nests: new results , 1995, HPCN Europe.

[85]  E. A. Maxwell Book Reviews: The Methods of Plane Projective Geometry Based on the Use of General Homogeneous Coordinates , 1946 .

[86]  Martin Griebl,et al.  The Loop Parallelizer LooPo-Announcement , 1996, LCPC.

[87]  Jingling Xue,et al.  Reuse-Driven Tiling for Improving Data Locality , 1998, International Journal of Parallel Programming.

[88]  Christian Lengauer,et al.  Loop Parallelization in the Polytope Model , 1993, CONCUR.

[89]  Paul Feautrier,et al.  Some efficient solutions to the affine scheduling problem. Part II. Multidimensional time , 1992, International Journal of Parallel Programming.

[90]  Martin Griebl,et al.  Issues of the Automatic Generation of HPF Loop Programs , 2000, LCPC.

[91]  Martin Griebl,et al.  Data Flow Analysis of Recursive Structures , 1996 .

[92]  Ken Kennedy,et al.  Evaluating Compiler Optimizations for Fortran D , 1994, J. Parallel Distributed Comput..

[93]  Erik H. D'Hollander,et al.  Partitioning and Labeling of Loops by Unimodular Transformations , 1992, IEEE Trans. Parallel Distributed Syst..

[94]  Lawrence Rauchwerger,et al.  The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization , 1995, PLDI '95.

[95]  Frédéric Vivien,et al.  A unified framework for schedule and storage optimization , 2001, PLDI '01.

[96]  P. Feautrier Parametric integer programming , 1988 .

[97]  Jürgen Teich,et al.  Partitioning of processor arrays: a piecewise regular approach , 1993, Integr..

[98]  Martin Griebl,et al.  Replicated Placements in the Polyhedron Model , 2003, Euro-Par.

[99]  Martin Griebl,et al.  Applicability of the Polytope Model to Functional Programs , 1998 .

[100]  Yonghong Song,et al.  Unroll-and-jam for imperfectly-nested loops in DSP applications , 2000, CASES '00.

[101]  Patrice Quinton,et al.  The systematic design of systolic arrays , 1987 .

[102]  Paul Feautrier,et al.  Automatic Parallelization in the Polytope Model , 1996, The Data Parallel Programming Model.

[103]  Frank Harary,et al.  Graph Theory , 2016 .

[104]  Hyuk-Jae Lee,et al.  Communication-Minimal Partitioning and Data Alignment for Affine Nested Loops , 1997, Comput. J..

[105]  Martin Griebl The mechanical parallelization of loop nests containing while loops , 1997 .

[106]  Michael Philippsen,et al.  JavaParty – transparent remote objects in Java , 1997 .

[107]  Keshav Pingali,et al.  Synthesizing Transformations for Locality Enhancement of Imperfectly-Nested Loop Nests , 2001, International Journal of Parallel Programming.

[108]  Gilles Villard,et al.  Lattice-based memory allocation , 2003, IEEE Transactions on Computers.

[109]  Sanjay V. Rajopadhye,et al.  Optimizing memory usage in the polyhedral model , 2000, TOPL.

[110]  Paul Feautrier Toward Automatic Distribution , 1994, Parallel Process. Lett..

[111]  Dan I. Moldovan,et al.  Partitioning and Mapping Algorithms into Fixed Size Systolic Arrays , 1986, IEEE Transactions on Computers.

[112]  Ken Kennedy,et al.  Automatic translation of FORTRAN programs to vector form , 1987, TOPL.

[113]  A. J. C. Bik,et al.  Advanced compiler optimizations for sparse computations , 1993, Supercomputing '93.

[114]  Daniel A. Reed,et al.  Stencils and Problem Partitionings: Their Influence on the Performance of Multiple Processor Systems , 1987, IEEE Transactions on Computers.

[115]  William Pugh,et al.  Static analysis of upper and lower bounds on dependences and parallelism , 1994, TOPL.

[116]  Yves Robert,et al.  Loop Parallelization Algorithms , 2001, Compiler Optimizations for Scalable Parallel Systems Languages.

[117]  Jingling Xue,et al.  Communication-Minimal Tiling of Uniform Dependence Loops , 1996, J. Parallel Distributed Comput..

[118]  Richard M. Karp,et al.  The Organization of Computations for Uniform Recurrence Equations , 1967, JACM.

[119]  Martin Griebl,et al.  Index Set Splitting , 2000, International Journal of Parallel Programming.

[120]  Paul Feautrier,et al.  Some efficient solutions to the affine scheduling problem. I. One-dimensional time , 1992, International Journal of Parallel Programming.

[121]  Michael Wolfe,et al.  Iteration Space Tiling for Memory Hierarchies , 1987, PPSC.

[122]  Katherine Yelick,et al.  Titanium: a high-performance Java dialect , 1998 .

[123]  Peiyi Tang,et al.  Dynamic Processor Self-Scheduling for General Parallel Nested Loops , 1987, IEEE Trans. Computers.

[124]  Yves Robert,et al.  Determining the idle time of a tiling: new results , 1997, Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques.

[125]  Mohamed Jemni,et al.  Restructuring and Parallelizing a Static Conditional Loop , 1995, Parallel Comput..

[126]  D. Sorensen Numerical methods for large eigenvalue problems , 2002, Acta Numerica.

[127]  Guy L. Steele,et al.  The High Performance Fortran Handbook , 1993 .

[128]  Armin Größlinger,et al.  Introducing Non-linear Parameters to the Polyhedron Model , 2004 .

[129]  Arthur J. Bernstein,et al.  Analysis of Programs for Parallel Processing , 1966, IEEE Trans. Electron. Comput..