Automatic data and computation decomposition on distributed memory parallel computers

To exploit parallelism on shared memory parallel computers (SMPCs), it is natural to focus on decomposing the computation (mainly by distributing the iterations of the nested Do-Loops). In contrast, on distributed memory parallel computers (DMPCs), the decomposition of computation and the distribution of data must both be handled---in order to balance the computation load and to minimize the migration of data. We propose and validate experimentally a method for handling computations and data synergistically to minimize the overall execution time on DMPCs. The method is based on a number of novel techniques, also presented in this article. The core idea is to rank the "importance" of data arrays in a program and specify some of the dominant. The intuition is that the dominant arrays are the ones whose migration would be the most expensive. Using the correspondence between iteration space mapping vectors and distributed dimensions of the dominant data array in each nested Do-loop, allows us to design algorithms for determining data and computation decompositions at the same time. Based on data distribution, computation decomposition for each nested Do-loop is determined based on either the "owner computes" rule or the "owner stores" rule with respect to the dominant data array. If all temporal dependence relations across iteration partitions are regular, we use tiling to allow pipelining and the overlapping of computation and communication. However, in order to use tiling on DMPCs, we needed to extend the existing techniques for determining tiling vectors and tile sizes, as they were originally suited for SMPCs only. The overall method is illustrated on programs for the 2D heat equation, for the Gaussian elimination with pivoting, and for the 2D fast Fourier transform on a linear processor array and on a 2D processor grid.

[1]  Monica S. Lam,et al.  Automatic computation and data decomposition for multiprocessors , 1997 .

[2]  Monica S. Lam,et al.  Communication-Free Parallelization via Affine Transformations , 1994, LCPC.

[3]  Manish Gupta,et al.  On privatization of variables for data-parallel execution , 1997, Proceedings 11th International Parallel Processing Symposium.

[4]  Ken Kennedy,et al.  Automatic data layout for distributed-memory machines , 1998, TOPL.

[5]  Ching-Tien Ho,et al.  Optimal communication primitives and graph embeddings on hypercubes , 1990 .

[6]  Jingling Xue Communication-Minimal Tiling of Uniform Dependence Loops , 1997, J. Parallel Distributed Comput..

[7]  Jingling Xue,et al.  Communication-Minimal Tiling of Uniform Dependence Loops , 1996, J. Parallel Distributed Comput..

[8]  J. Ramanujam,et al.  Compile-Time Techniques for Data Distribution in Distributed Memory Machines , 1991, IEEE Trans. Parallel Distributed Syst..

[9]  Barbara M. Chapman,et al.  Supercompilers for parallel and vector computers , 1990, ACM Press frontier series.

[10]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[11]  PeiZong Lee Efficient Algorithms for Data Distribution on Distributed Memory Parallel Computers , 1997, IEEE Trans. Parallel Distributed Syst..

[12]  Marina C. Chen,et al.  The Generation of a Class of Multipliers: Synthesizing Highly Parallel Algorithms in VLSI , 1988, IEEE Trans. Computers.

[13]  Weijia Shang,et al.  On Supernode Transformation with Minimized Total Running Time , 1998, IEEE Trans. Parallel Distributed Syst..

[14]  Ulrich Kremer,et al.  Fortran RED - A Retargetable Environment for Automatic Data Layout , 1998, LCPC.

[15]  P. Sadayappan,et al.  Communication-Free Hyperplane Partitioning of Nested Loops , 1993, J. Parallel Distributed Comput..

[16]  Monica Sin-Ling Lam,et al.  A Systolic Array Optimizing Compiler , 1989 .

[17]  Harry Berryman,et al.  Distributed Memory Compiler Design for Sparse Problems , 1995, IEEE Trans. Computers.

[18]  Christian Lengauer,et al.  The derivation of systolic implementations of programs , 2004, Acta Informatica.

[19]  Constantine D. Polychronopoulos Compiler Optimizations for Enhancing Parallelism and Their Impact on Architecture Design , 1988, IEEE Trans. Computers.

[20]  Prithviraj Banerjee,et al.  Compiler techniques for optimizing communication and data distribution for distributed-memory multicomputers , 1996 .

[21]  Marina C. Chen,et al.  Generating explicit communication from shared-memory program references , 1990, Proceedings SUPERCOMPUTING '90.

[22]  Jang-Ping Sheu,et al.  Statement-Level Communication-Free Partitioning Techniques for Parallelizing Compilers , 2004, The Journal of Supercomputing.

[23]  Ping-Sheng Tseng A Systolic Array Parallelizing Compiler , 1990, J. Parallel Distributed Comput..

[24]  Monica S. Lam,et al.  An affine partitioning algorithm to maximize parallelism and minimize communication , 1999, ICS '99.

[25]  W. Shang,et al.  On Time Mapping of Uniform Dependence Algorithms into Lower Dimensional Processor Arrays , 1992, IEEE Trans. Parallel Distributed Syst..

[26]  Jang-Ping Sheu,et al.  Communication-Free Data Allocation Techniques for Parallelizing Compilers on Multicomputers , 1994, IEEE Trans. Parallel Distributed Syst..

[27]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[28]  P. Lee,et al.  Generating Global Name-Space Communication Sets for Array Assignment Statements , 1997 .

[29]  Mahmut T. Kandemir,et al.  Data Relation Vectors: A New Abstraction for Data Optimizations , 2001, IEEE Trans. Computers.

[30]  Sandeep K. S. Gupta,et al.  An Interprocedural Framework for Determining Efficient Array Data Redistributeions , 1998, J. Inf. Sci. Eng..

[31]  Marina C. Chen,et al.  The Data Alignment Phase in Compiling Programs for Distrubuted-Memory Machines , 1991, J. Parallel Distributed Comput..

[32]  PeiZong Lee,et al.  Techniques for Compiling Programs on Distributed Memory Multicomputers , 1995, Parallel Comput..

[33]  Marina C. Chen,et al.  Compiling Communication-Efficient Programs for Massively Parallel Machines , 1991, IEEE Trans. Parallel Distributed Syst..

[34]  Larry Carter,et al.  Selecting tile shape for minimal execution time , 1999, SPAA '99.

[35]  Charles Koelbel,et al.  High Performance Fortran Handbook , 1993 .

[36]  Monica S. Lam,et al.  A Loop Transformation Theory and an Algorithm to Maximize Parallelism , 1991, IEEE Trans. Parallel Distributed Syst..

[37]  Manish Gupta,et al.  A methodology for high-level synthesis of communication on multicomputers , 1992, ICS '92.

[38]  Isidoro Couvertier-Reyes,et al.  Automatic Data and Computation Mapping for Distributed-Memory Machines. , 1996 .

[39]  Rajeev Barua,et al.  Communication-Minimal Partitioning of Parallel Loops and Data Arrays for Cache-Coherent Distributed-Memory Multiprocessors , 1996, LCPC.

[40]  Jang-Ping Sheu,et al.  Statement-Level Communication-Free Partitioning Techniques for Parallelizing Compilers , 1996, LCPC.

[41]  J. Ramanujam,et al.  Tiling Multidimensional Itertion Spaces for Multicomputers , 1992, J. Parallel Distributed Comput..

[42]  John R. Gilbert,et al.  Modeling Data-Parallel Programs with the Alignment-Distribution Graph , 1994 .

[43]  Jang-Ping Sheu,et al.  Communication-Free Data Allocation Techniques for Parallelizing Compilers on Multicomputers , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[44]  Jan-Jan Wu Optimization and transformation techniques for high performance Fortran , 1996 .

[45]  G. C. Fox,et al.  Solving Problems on Concurrent Processors , 1988 .

[46]  John R. Gilbert,et al.  Automatic array alignment in data-parallel programs , 1993, POPL '93.

[47]  Zvi M. Kedem,et al.  On high-speed computing with a programmable linear array , 1988, Supercomputing '88.

[48]  Ken Kennedy,et al.  Compiling programs for distributed-memory multiprocessors , 1988, The Journal of Supercomputing.

[49]  M. Guptay,et al.  Compile-Time Estimation of Communication Costs ofPrograms , 1994 .

[50]  Anne Rogers,et al.  Compiling for Distributed Memory Architectures , 1994, IEEE Trans. Parallel Distributed Syst..

[51]  Ken Kennedy,et al.  Automatic Data Layout for Distributed-Memory Machines in the D Programming Environment , 1994, Automatic Parallelization.

[52]  William Gropp,et al.  Users guide for mpich, a portable implementation of MPI , 1996 .

[53]  Yves Robert,et al.  Determining the idle time of a tiling: new results , 1997, Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques.

[54]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[55]  Hans P. Zima,et al.  Compiling for distributed-memory systems , 1993 .

[56]  Mark A. Johnson,et al.  Solving problems on concurrent processors. Vol. 1: General techniques and regular problems , 1988 .

[57]  John R. Gilbert,et al.  Array Distribution in Data-Parallel Programs , 1994, LCPC.

[58]  Anant Agarwal,et al.  Automatic Partitioning of Parallel Loops for Cache-Coherent Multiprocessors , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[59]  Vikram S. Adve,et al.  High Performance Fortran Compilation Techniques for Parallelizing Scientific Codes , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[60]  Anant Agarwal,et al.  Automatic Partitioning of Parallel Loops and Data Arrays for Distributed Shared-Memory Multiprocessors , 1995, IEEE Trans. Parallel Distributed Syst..

[61]  Lynn Conway,et al.  Introduction to VLSI systems , 1978 .

[62]  Kai Hwang,et al.  Advanced computer architecture - parallelism, scalability, programmability , 1992 .

[63]  PEIZONG LEE,et al.  Synthesizing Linear Array Algorithms from Nested For Loop Algorithms , 2015, IEEE Trans. Computers.

[64]  Monica S. Lam,et al.  Maximizing Parallelism and Minimizing Synchronization with Affine Partitions , 1998, Parallel Comput..

[65]  Guang R. Gao,et al.  Automatic data and computation decomposition for distributed memory machines , 1995, Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences.

[66]  Ken Kennedy,et al.  Compiling Fortran D for MIMD distributed-memory machines , 1992, CACM.

[67]  Manish Gupta,et al.  Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers , 1992, IEEE Trans. Parallel Distributed Syst..

[68]  Dan I. Moldovan,et al.  Partitioning and Mapping Algorithms into Fixed Size Systolic Arrays , 1986, IEEE Transactions on Computers.

[69]  Hudson Benedito Ribas Automatic generation of systolic programs from nested loops , 1990 .

[70]  Ken Kennedy,et al.  Automatic translation of FORTRAN programs to vector form , 1987, TOPL.

[71]  Yves Robert,et al.  (Pen)-ultimate tiling? , 1994, Integr..

[72]  Guy L. Steele,et al.  The High Performance Fortran Handbook , 1993 .

[73]  Zvi M. Kedem,et al.  Mapping Nested Loop Algorithms into Multidimensional Systolic Arrays , 2017, IEEE Trans. Parallel Distributed Syst..