A unified compiler algorithm for optimizing locality, parallelism and communication in out-of-core computations

This paper presents compiler algorithms to optimize outof-core programs. These algorithms consider loop and data layout transformations in a tied framework. The performance of an out-of-core loop nest containing many references can be improved by a combination of restructuring the loops and file layouts. This approach considers array references one-by-one and attempts to optimize each reference for parallelism and locality. When there are references for which parallelism optimizations do not work, communication is vectorized so that data transfer can be performed before the innermost tiling loop. Preliminary re.suIts from handcompiles on IBM SP-2 and Intel Paragon show that this approach reduces the execution time, improves the bandwidth speedup and overall speedup. In addition, we extend the base algorithm to work with file layout constraints and show how it can be used for optimizing programs consisting of multiple loop nests.

[1]  Monica S. Lam,et al.  A Loop Transformation Theory and an Algorithm to Maximize Parallelism , 1991, IEEE Trans. Parallel Distributed Syst..

[2]  Ken Kennedy,et al.  Compiler support for out-of-core arrays on parallel machines , 1995, Proceedings Frontiers '95. The Fifth Symposium on the Frontiers of Massively Parallel Computation.

[3]  Kishor S. Trivedi On the Paging Performance of Array Algorithms , 1977, IEEE Transactions on Computers.

[4]  Ken Kennedy,et al.  A model and compilation strategy for out-of-core data parallel programs , 1995, PPOPP '95.

[5]  Edward G. Coffman,et al.  Organizing matrices and matrix operations for paged memory systems , 1969, Commun. ACM.

[6]  KremerUlrich,et al.  Automatic data layout for distributed-memory machines , 1998 .

[7]  Keshav Pingali,et al.  Access normalization: loop restructuring for NUMA compilers , 1992, ASPLOS V.

[8]  Thomas H. Cormen,et al.  ViC*: A Preprocessor for Virtual-Memory C* , 1994 .

[9]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[10]  J. Ramanujam,et al.  Non-unimodular transformations of nested loops , 1992, Proceedings Supercomputing '92.

[11]  J. Ramanujam,et al.  Integrating Data Distribution and Loop Transformations , 1995, PPSC.

[12]  Carla Schlatter Ellis,et al.  Characterizing parallel file-access patterns on a large-scale multiprocessor , 1995, IPPS.

[13]  A. C. McKellar,et al.  The organization of matrices and matrix operations in a paged multiprogramming environment , 1968 .

[14]  Margaret Martonosi,et al.  Evaluating the impact of advanced memory systems on compiler-parallelized codes , 1995, PACT.

[15]  Guy L. Steele,et al.  The High Performance Fortran Handbook , 1993 .

[16]  Mahmut T. Kandemir,et al.  Data access reorganizations in compiling out-of-core data parallel programs on distributed memory machines , 1997, Proceedings 11th International Parallel Processing Symposium.

[17]  Monica S. Lam,et al.  Data and computation transformations for multiprocessors , 1995, PPOPP '95.

[18]  Ken Kennedy,et al.  Automatic data layout for distributed-memory machines , 1998, TOPL.

[19]  Wei Li,et al.  Compiling for NUMA Parallel Machines , 1993 .

[20]  Ken Kennedy,et al.  Compiling Fortran D for MIMD distributed-memory machines , 1992, CACM.

[21]  Todd C. Mowry,et al.  Automatic compiler-inserted I/O prefetching for out-of-core applications , 1996, OSDI '96.

[22]  Duncan H. Lawrie,et al.  On the Performance Enhancement of Paging Systems Through Program Analysis and Transformations , 1981, IEEE Transactions on Computers.

[23]  Amit Narayan,et al.  Automatic Data Mapping and Program Transformations , 1995 .

[24]  Alok N. Choudhary,et al.  Automatic optimization of communication in compiling out-of-core stencil codes , 1996, ICS '96.

[25]  Wei Li,et al.  Unifying data and control transformations for distributed shared-memory machines , 1995, PLDI '95.

[26]  Rajesh R. Bordawekar,et al.  Techniques for compiling i/o intensive parallel programs , 1996 .

[27]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[28]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[29]  Keshav Pingali,et al.  Access normalization: loop restructuring for NUMA computers , 1993, TOCS.

[30]  Monica S. Lam,et al.  Global optimizations for parallelism and locality on scalable parallel machines , 1993, PLDI '93.