Generating efficient tiled code for distributed memory machines

Abstract Tiling can improve the performance of nested loops on distributed memory machines by exploiting coarse-grain parallelism and reducing communication overhead and frequency. Tiling calls for a compilation approach that performs first computation distribution and then data distribution, both possibly on a skewed iteration space. This paper presents a suite of compiler techniques for generating efficient SPMD programs to execute rectangularly tiled iteration spaces on distributed memory machines. The following issues are addressed: computation and data distribution, message-passing code generation, memory management and optimisations, and global to local address translation. Methods are developed for partitioning arbitrary iteration spaces and skewed data spaces. Techniques for generating efficient message-passing code for both arbitrary and rectangular iteration spaces are presented. A storage scheme for managing both local and nonlocal references is developed, which leads to the SPMD code with high locality of references. Two memory optimisations are given to reduce the amount of memory usage for skewed iteration spaces and expanded arrays, respectively. The proposed compiler techniques are illustrated using a simple running example and finally analysed and evaluated based on experimental results on a Fujitsu AP1000 consisting of 128 processors.

[1]  Ken Kennedy,et al.  Compiler blockability of numerical algorithms , 1992, Proceedings Supercomputing '92.

[2]  Jingling Xue Communication-Minimal Tiling of Uniform Dependence Loops , 1997, J. Parallel Distributed Comput..

[3]  Charles Koelbel Compile-time generation of regular communications patterns , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[4]  Yves Robert,et al.  On the removal of anti and output dependences , 1996, Proceedings of International Conference on Application Specific Systems, Architectures and Processors: ASAP '96.

[5]  Guy L. Steele,et al.  The High Performance Fortran Handbook , 1993 .

[6]  Larry Carter,et al.  Efficient Parallelism via Hierarchical Tiling , 1995, PPSC.

[7]  Weijia Shang,et al.  Independent Partitioning of Algorithms with Uniform Dependencies , 1992, IEEE Trans. Computers.

[8]  John R. Gilbert,et al.  Generating local addresses and communication sets for data-parallel programs , 1993, PPOPP '93.

[9]  Paul Feautrier,et al.  Construction of Do Loops from Systems of Affine Constraints , 1995, Parallel Process. Lett..

[10]  David K. Smith Theory of Linear and Integer Programming , 1987 .

[11]  Ken Kennedy,et al.  Practical dependence testing , 1991, PLDI '91.

[12]  Sanjay V. Rajopadhye,et al.  Optimal Orthogonal Tiling of 2-D Iterations , 1997, J. Parallel Distributed Comput..

[13]  Yves Robert,et al.  Tiling with limited resources , 1997, Proceedings IEEE International Conference on Application-Specific Systems, Architectures and Processors.

[14]  Michael E. Wolf,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[15]  Hiroshi Ohta,et al.  Optimal tile size adjustment in compiling general DOACROSS loop nests , 1995, ICS '95.

[16]  Ken Kennedy,et al.  Compiling Fortran D for MIMD distributed-memory machines , 1992, CACM.

[17]  J. Ramanujam,et al.  Fast Address Sequence Generation for Data-Parallel Programs Using Integer Lattices , 1995, LCPC.

[18]  Yves Robert,et al.  Evaluating Array Expressions On Massively Parallel Machines With Communication/ Computation Overlap , 1995, Int. J. High Perform. Comput. Appl..

[19]  S. Rajopadhye Optimal Tiling of Two-Dimensional Uniform Recurrences , 1996 .

[20]  Siegfried Benkner,et al.  Vienna Fortran 90 , 1992, Proceedings Scalable High Performance Computing Conference SHPCC-92..

[21]  Larry Carter,et al.  Quantifying the Multi-level Nature of Tiling Interactions , 1997, LCPC.

[22]  Fabien Coelho,et al.  State of the Art in Compiling HPF , 1996, The Data Parallel Programming Model.

[23]  Monica S. Lam,et al.  A data locality optimizing algorithm (with retrospective) , 1991 .

[24]  Peiyi Tang,et al.  Reducing data communication overhead for DOACROSS loop nests , 1994, ICS '94.

[25]  Jingling Xue,et al.  Reuse-Driven Tiling for Data Locality , 1997, LCPC.

[26]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[27]  Jingling Xue,et al.  On Tiling as a Loop Transformation , 1997, Parallel Process. Lett..

[28]  Jack Dongarra,et al.  Automatic Blocking of Nested Loops , 1990 .

[29]  William Pugh,et al.  A practical algorithm for exact array dependence analysis , 1992, CACM.

[30]  Peiyi Tang,et al.  Implementing global address space in distributed local memories , 1994 .

[31]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[32]  Sandeep K. S. Gupta,et al.  On Compiling Array Expressions for Efficient Execution on Distributed-Memory Machines , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[33]  Paul Feautrier,et al.  Optimizing Storage Size for Static Control Programs in Automatic Parallelizers , 1997, Euro-Par.

[34]  J. Ramanujam,et al.  Tiling Multidimensional Itertion Spaces for Multicomputers , 1992, J. Parallel Distributed Comput..

[35]  Frédéric Vivien,et al.  Combining Retiming and Scheduling Techniques for Loop Parallelization and Loop Tiling , 1997, Parallel Process. Lett..

[36]  William Pugh,et al.  The Omega test: A fast and practical integer programming algorithm for dependence analysis , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[37]  David A. Padua,et al.  Advanced compiler optimizations for supercomputers , 1986, CACM.

[38]  Zhiyuan Li,et al.  Symbolic Array Dataflow Analysis for Array Privatization and Program Parallelization , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[39]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[40]  Monica S. Lam,et al.  A Loop Transformation Theory and an Algorithm to Maximize Parallelism , 1991, IEEE Trans. Parallel Distributed Syst..

[41]  Chien-Min Wang,et al.  Tiling Nested Loops into Maximal Rectangular Blocks , 1996, J. Parallel Distributed Comput..

[42]  Michael Gerndt,et al.  Updating Distributed Variables in Local Computations , 1990, Concurr. Pract. Exp..