A Unified Framework for Optimizing Locality, Parallelism, and Communication in Out-of-Core Computations

This paper presents a unified framework that optimizes out-of-core programs by exploiting locality and parallelism, and reducing communication overhead. For out-of-core problems where the data set sizes far exceed the size of the available in-core memory, it is particularly important to exploit the memory hierarchy by optimizing the I/O accesses. We present algorithms that consider both iteration space (loop) and data space (file layout) transformations in a unified framework. We show that the performance of an out-of-core loop nest containing references to out-of-core arrays can be improved by using a suitable combination of file layout choices and loop restructuring transformations. Our approach considers array references one-by-one and attempts to optimize each reference for parallelism and locality. When there are references for which parallelism optimizations do not work, communication is vectorized so that data transfer can be performed before the innermost loop. Results from hand-compiles on IBM SP-2 and Inter Paragon distributed-memory message-passing architectures show that this approach reduces the execution times and improves the overall speedups. In addition, we extend the base algorithm to work with file layout constraints and show how it is useful for optimizing programs that consist of multiple loop nests.

[1]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[2]  Keshav Pingali,et al.  Access normalization: loop restructuring for NUMA computers , 1993, TOCS.

[3]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[4]  Rajesh R. Bordawekar,et al.  Techniques for compiling i/o intensive parallel programs , 1996 .

[5]  L. C. Smith PASSION Runtime Library for Parallel I/O , 1994 .

[6]  Ken Kennedy,et al.  Automatic data layout for distributed-memory machines , 1998, TOPL.

[7]  Monica S. Lam,et al.  A Loop Transformation Theory and an Algorithm to Maximize Parallelism , 1991, IEEE Trans. Parallel Distributed Syst..

[8]  Monica S. Lam,et al.  Global optimizations for parallelism and locality on scalable parallel machines , 1993, PLDI '93.

[9]  Keshav Pingali,et al.  Data-centric multi-level blocking , 1997, PLDI '97.

[10]  Edward G. Coffman,et al.  Organizing matrices and matrix operations for paged memory systems , 1969, Commun. ACM.

[11]  Alok N. Choudhary,et al.  Automatic optimization of communication in compiling out-of-core stencil codes , 1996, ICS '96.

[12]  Mahmut T. Kandemir,et al.  A unified compiler algorithm for optimizing locality, parallelism and communication in out-of-core computations , 1997, IOPADS '97.

[13]  Steven W. K. Tjiang,et al.  SUIF: an infrastructure for research on parallelizing and optimizing compilers , 1994, SIGP.

[14]  Alok Choudhary,et al.  PASSION Runtime Library for parallel I/O , 1994, Proceedings Scalable Parallel Libraries Conference.

[15]  Mahmut T. Kandemir,et al.  Improving the performance of out-of-core computations , 1997, Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162).

[16]  Mahmut T. Kandemir,et al.  Data access reorganizations in compiling out-of-core data parallel programs on distributed memory machines , 1997, Proceedings 11th International Parallel Processing Symposium.

[17]  Monica S. Lam,et al.  Data and computation transformations for multiprocessors , 1995, PPOPP '95.

[18]  J. Ramanujam,et al.  Integrating Data Distribution and Loop Transformations , 1995, PPSC.

[19]  Rajeev Thakur,et al.  Compilation of out-of-core data parallel programs for distributed memory machines , 1994, CARN.

[20]  Mary E. Mace Memory storage patterns in parallel processing , 1987, The Kluwer international series in engineering and computer science.

[21]  Keshav Pingali,et al.  Access normalization: loop restructuring for NUMA compilers , 1992, ASPLOS V.

[22]  Margaret Martonosi,et al.  Evaluating the impact of advanced memory systems on compiler-parallelized codes , 1995, PACT.

[23]  Thomas H. Cormen,et al.  ViC*: A Preprocessor for Virtual-Memory C* , 1994 .

[24]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[25]  Wei Li,et al.  Compiling for NUMA Parallel Machines , 1993 .

[26]  Ken Kennedy,et al.  Compiling Fortran D for MIMD distributed-memory machines , 1992, CACM.

[27]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[28]  Todd C. Mowry,et al.  Automatic compiler-inserted I/O prefetching for out-of-core applications , 1996, OSDI '96.

[29]  Manish Gupta,et al.  Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers , 1992, IEEE Trans. Parallel Distributed Syst..

[30]  Isidoro Couvertier-Reyes,et al.  Automatic Data and Computation Mapping for Distributed-Memory Machines. , 1996 .

[31]  Duncan H. Lawrie,et al.  On the Performance Enhancement of Paging Systems Through Program Analysis and Transformations , 1981, IEEE Transactions on Computers.

[32]  Amit Narayan,et al.  Automatic Data Mapping and Program Transformations , 1995 .

[33]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[34]  Kishor S. Trivedi On the Paging Performance of Array Algorithms , 1977, IEEE Transactions on Computers.

[35]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[36]  Ken Kennedy,et al.  A model and compilation strategy for out-of-core data parallel programs , 1995, PPOPP '95.

[37]  Vivek Sarkar,et al.  On Estimating and Enhancing Cache Effectiveness , 1991, LCPC.

[38]  John Zahorjan,et al.  Optimizing Data Locality by Array Restructuring , 1995 .

[39]  Mahmut T. Kandemir,et al.  Global I/O optimizations for out-of-core computations , 1997, Proceedings Fourth International Conference on High-Performance Computing.

[40]  Ken Kennedy,et al.  Compiler support for out-of-core arrays on parallel machines , 1995, Proceedings Frontiers '95. The Fifth Symposium on the Frontiers of Massively Parallel Computation.

[41]  Wei Li,et al.  Unifying data and control transformations for distributed shared-memory machines , 1995, PLDI '95.

[42]  Michael F. P. O'Boyle,et al.  Non-singular data transformations: definition, validity and applications , 1997, ICS '97.

[43]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[44]  A. C. McKellar,et al.  The organization of matrices and matrix operations in a paged multiprogramming environment , 1968 .

[45]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.