A Matrix-Based Approach to Global Locality Optimization

Global locality optimization is a technique for improving the cache performance of a sequence of loop nests through a combination of loop and data layout transformations. Pure loop transformations are restricted by data dependencies and may not be very successful in optimizing imperfectly nested loops and explicitly parallelized programs. Although pure data transformations are not constrained by data dependencies, the impact of a data transformation on an array might be program-wide; that is, it can affect all the references to that array in all the loop nests. Therefore, in this paper we argue for an integrated approach that employs both loop and data transformations. The method enjoys the advantages of most of the previous techniques for enhancing locality and is efficient. In our approach, the loop nests in a program are processed one by one and the data layout constraints obtained from one nest are propagated for optimizing the remaining loop nests. We show a simple and effective matrix-based framework to implement this process. The search space that we consider for possible loop transformations can be represented by general nonsingular linear transformation matrices and the data layouts that we consider are those that can be expressed using hyperplanes. Experiments with several floating-point programs on an 8-processor SGI Origin 2000 distributed-shared-memory machine demonstrate the efficacy of our approach.

[1]  Mahmut T. Kandemir,et al.  Compiler Algorithms for Optimizing Locality and Parallelism on Shared and Distributed-Memory Machines , 2000, J. Parallel Distributed Comput..

[2]  Wei Li,et al.  Compiling for NUMA Parallel Machines , 1993 .

[3]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[4]  Monica S. Lam,et al.  Global optimizations for parallelism and locality on scalable parallel machines , 1993, PLDI '93.

[5]  Michael E. Wolf,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[6]  Vadim Maslov,et al.  Delinearization: an efficient way to break multiloop dependence equations , 1992, PLDI '92.

[7]  Chau-Wen Tseng,et al.  Data transformations for eliminating conflict misses , 1998, PLDI.

[8]  Duncan H. Lawrie,et al.  On the Performance Enhancement of Paging Systems Through Program Analysis and Transformations , 1981, IEEE Transactions on Computers.

[9]  Constantine D. Polychronopoulos,et al.  Symbolic Analysis: A Basis for Parallelization, Optimization, and Scheduling of Programs , 1993, LCPC.

[10]  P. Sadayappan,et al.  Communication-Free Hyperplane Partitioning of Nested Loops , 1993, J. Parallel Distributed Comput..

[11]  Mahmut T. Kandemir,et al.  A matrix-based approach to the global locality optimization problem , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[12]  William Pugh,et al.  The Omega Library interface guide , 1995 .

[13]  Mahmut Kandemir,et al.  An Iteration Space Transformation Algorithm Based on Explicit Data Layout Representation for Optimizing Locality , 1999 .

[14]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[15]  Mahmut T. Kandemir,et al.  A hyperplane based approach for optimizing spatial locality in loop nests , 1998, ICS '98.

[16]  Henry G. Dietz,et al.  Reduction of Cache Coherence Overhead by Compiler Data Layout and Loop Transformation , 1991, LCPC.

[17]  Ken Kennedy,et al.  Improving the ratio of memory operations to floating-point operations in loops , 1994, TOPL.

[18]  Ken Kennedy,et al.  Automatic Data Layout for High Performance Fortran , 1995, SC.

[19]  Marina C. Chen,et al.  Compiling Communication-Efficient Programs for Massively Parallel Machines , 1991, IEEE Trans. Parallel Distributed Syst..

[20]  Tarek S. Abdelrahman,et al.  Automatic partitioning of data and computations on scalable shared memory multiprocessors , 1997, Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162).

[21]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[22]  Alexander Schrijver,et al.  Theory of linear and integer programming , 1986, Wiley-Interscience series in discrete mathematics and optimization.

[23]  Vivek Sarkar,et al.  A compiler framework for restructuring data declarations to enhance cache and TLB effectiveness , 1994, CASCON.

[24]  Michael Wolfe,et al.  More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[25]  Rudolf Eigenmann,et al.  An Overview of Symbolic Analysis Techniques Needed for the Effective Parallelization of the Perfect Benchmarks , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[26]  Edward G. Coffman,et al.  Organizing matrices and matrix operations for paged memory systems , 1969, Commun. ACM.

[27]  Nenad Nedeljkovic,et al.  Data distribution support on distributed shared memory multiprocessors , 1997, PLDI '97.

[28]  Keshav Pingali,et al.  Data-centric multi-level blocking , 1997, PLDI '97.

[29]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[30]  Chau-Wen Tseng,et al.  Unified compilation techniques for shared and distributed address space machines , 1995, ICS '95.

[31]  Mahmut T. Kandemir,et al.  A Loop Transformation Algorithm Based on Explicit Data Layout Representation for Optimizing Locality , 1998, LCPC.

[32]  J. Ramanujam,et al.  Compile-Time Techniques for Data Distribution in Distributed Memory Machines , 1991, IEEE Trans. Parallel Distributed Syst..

[33]  Yunheung Paek,et al.  Advanced Program Restructuring for High-Performance Computers with Polaris , 2000 .

[34]  John R. Gilbert,et al.  Optimal evaluation of array expressions on massively parallel machines , 1995, TOPL.

[35]  Jang-Ping Sheu,et al.  Communication-Free Partitioning of Nested Loops , 2001, Compiler Optimizations for Scalable Parallel Systems Languages.

[36]  Yunheung Paek,et al.  Parallel Programming with Polaris , 1996, Computer.

[37]  Michael F. P. O'Boyle,et al.  Integrating loop and data transformations for global optimisation , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[38]  A. Ibrahim Linear and Integer Linear Programming. , 1975 .

[39]  Bernard Kolman,et al.  Introductory Linear Algebra with Applications , 1976 .

[40]  Monica S. Lam,et al.  Data and computation transformations for multiprocessors , 1995, PPOPP '95.

[41]  Monica S. Lam,et al.  Automatic computation and data decomposition for multiprocessors , 1997 .

[42]  Dennis Gannon,et al.  Strategies for cache and local memory management by global program transformation , 1988, J. Parallel Distributed Comput..

[43]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[44]  Wei Li,et al.  Recovering Logical Data and Code Structures , 1995 .

[45]  Vivek Sarkar,et al.  On Estimating and Enhancing Cache Effectiveness , 1991, LCPC.

[46]  E. Ayguade,et al.  A Novel Approach Towards Automatic Data Distribution , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[47]  John Zahorjan,et al.  Optimizing Data Locality by Array Restructuring , 1995 .

[48]  Alexandru Nicolau,et al.  Advances in languages and compilers for parallel processing , 1991 .

[49]  Wei Li,et al.  Unifying data and control transformations for distributed shared-memory machines , 1995, PLDI '95.

[50]  Michael F. P. O'Boyle,et al.  Non-singular data transformations: definition, validity and applications , 1997, ICS '97.

[51]  Steven J. Leon Linear algebra with applications / Steven J. Leon , 1986 .

[52]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[53]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[54]  Mahmut T. Kandemir,et al.  A compiler algorithm for optimizing locality in loop nests , 1997, ICS '97.

[55]  Keshav Pingali,et al.  Transformations for Imperfectly Nested Loops , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[56]  A. C. McKellar,et al.  The organization of matrices and matrix operations in a paged multiprogramming environment , 1968 .

[57]  Josep Torrellas,et al.  False Sharing ans Spatial Locality in Multiprocessor Caches , 1994, IEEE Trans. Computers.

[58]  Ken Kennedy,et al.  Software methods for improvement of cache performance on supercomputer applications , 1989 .

[59]  Vivek Sarkar,et al.  Locality Analysis for Distributed Shared-Memory Multiprocessors , 1996, LCPC.

[60]  Margaret Martonosi,et al.  Evaluating the impact of advanced memory systems on compiler-parallelized codes , 1995, PACT.

[61]  Yves Robert,et al.  How to optimize residual communications? , 1996, Proceedings of International Conference on Parallel Processing.