Static and Dynamic Locality Optimizations Using Integer Linear Programming

The delivered performance on modern processors that employ deep memory hierarchies is closely related to the performance of the memory subsystem. Compiler optimizations aimed at improving cache locality are critical in realizing the performance potential of powerful processors. For scientific applications, several loop transformations have been shown to be useful in improving both temporal and spatial locality. Recently, there has been some work in the area of data layout optimizations, i.e., changing the memory layouts of multidimensional arrays from the language-defined default such as column-major storage in Fortran. The effect of such memory layout decisions is on the spatial locality characteristics of loop nests. While data layout transformations are not constrained by data dependences, they have no effect on temporal locality. On the other hand, loop transformations are not readily applicable to imperfect loop nests and are constrained by data dependences. More importantly, loop transformations affect the memory access patterns of all the arrays accessed in a loop nest and, as a result, the locality characteristics of some of the arrays may worsen. This paper presents a technique based on integer linear programming (ILP) that attempts to derive the best combination of loop and data layout transformations. Prior attempts to unify loop and data layout transformations for programs consisting of a sequence of loop nests have been based on heuristics not only for transformations for a single loop nest but also for the sequence in which loop nests will be considered. The ILP formulation presented here obviates the need for such heuristics and gives us a bar against which the heuristic algorithms can be compared. More importantly, our approach is able to transform memory layouts dynamically during program execution. This is particularly useful in applications whose disjoint code segments demand different layouts for a given array. In addition, we show how this formulation can be extended to address the false sharing problem in a multiprocessor environment. The key data structure we introduce is the memory layout graph (MLG) that allows us to formulate the problems as path problems. The paper discusses the relationship of this ILP approach based on the memory layout graphs to other work in the area including our previous work. Experimental results on a MIPS R10000-based system demonstrate the benefits of this approach and show that the use of the ILP formulation does not increase the compilation time significantly.

[1]  Vivek Sarkar,et al.  On Estimating and Enhancing Cache Effectiveness , 1991, LCPC.

[2]  Walid Abu-Sufah,et al.  Improving the performance of virtual memory computers. , 1979 .

[3]  E. Ayguade,et al.  A Novel Approach Towards Automatic Data Distribution , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[4]  John Zahorjan,et al.  Optimizing Data Locality by Array Restructuring , 1995 .

[5]  Chau-Wen Tseng,et al.  Data transformations for eliminating conflict misses , 1998, PLDI.

[6]  K. Kennedy,et al.  Automatic Data Layout for High Performance Fortran , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[7]  Barbara M. Chapman,et al.  Supercompilers for parallel and vector computers , 1990, ACM Press frontier series.

[8]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[9]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[10]  Milind Girkar,et al.  Parafrase-2: an Environment for Parallelizing, Partitioning, Synchronizing, and Scheduling Programs on Multiprocessors , 1989, Int. J. High Speed Comput..

[11]  David A. Patterson,et al.  Computer architecture (2nd ed.): a quantitative approach , 1996 .

[12]  Mahmut T. Kandemir,et al.  A graph based framework to detect optimal memory layouts for improving data locality , 1999, Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999.

[13]  S. Turner,et al.  Performance Analysis Using the MIPS R10000 Performance Counters , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[14]  Ricardo Bianchini,et al.  Application Performance on the MIT Alewife Machine , 1996, Computer.

[15]  Jordi Torres,et al.  Partitioning the statement per iteration space using non-singular matrices , 1993, ICS '93.

[16]  William Pugh,et al.  The Omega Library interface guide , 1995 .

[17]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[18]  Steve Carr,et al.  Combining optimization for cache and instruction-level parallelism , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.

[19]  Nenad Nedeljkovic,et al.  Data distribution support on distributed shared memory multiprocessors , 1997, PLDI '97.

[20]  Keshav Pingali,et al.  Data-centric multi-level blocking , 1997, PLDI '97.

[21]  Wei Li,et al.  Compiling for NUMA Parallel Machines , 1993 .

[22]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[23]  K.M. Dixit New CPU benchmark suites from SPEC , 1992, Digest of Papers COMPCON Spring 1992.

[24]  Manish Gupta,et al.  Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers , 1992, IEEE Trans. Parallel Distributed Syst..

[25]  Monica S. Lam,et al.  Data and computation transformations for multiprocessors , 1995, PPOPP '95.

[26]  Keshav Pingali,et al.  Transformations for Imperfectly Nested Loops , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[27]  Laurence A. Wolsey,et al.  Integer and Combinatorial Optimization , 1988, Wiley interscience series in discrete mathematics and optimization.

[28]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[29]  Mahmut T. Kandemir,et al.  A hyperplane based approach for optimizing spatial locality in loop nests , 1998, ICS '98.

[30]  Jacqueline Chame,et al.  The combined effectiveness of unimodular transformations, tiling, and software prefetching , 1996, Proceedings of International Conference on Parallel Processing.

[31]  Henry G. Dietz,et al.  Reduction of Cache Coherence Overhead by Compiler Data Layout and Loop Transformation , 1991, LCPC.

[32]  Eduard Ayguade,et al.  Dynamic data distribution with control flow analysis , 1996, Supercomputing '96.

[33]  Prithviraj Banerjee,et al.  Automatic Selection of Dynamic Data Partitioning Schemes for Distributed-Memory Multicomputers , 1995, LCPC.

[34]  Susan J. Eggers,et al.  Reducing false sharing on shared memory multiprocessors through compile time data transformations , 1995, PPOPP '95.

[35]  Marina C. Chen,et al.  Compiling Communication-Efficient Programs for Massively Parallel Machines , 1991, IEEE Trans. Parallel Distributed Syst..

[36]  Tarek S. Abdelrahman,et al.  Automatic partitioning of data and computations on scalable shared memory multiprocessors , 1997, Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162).

[37]  Wei Li,et al.  Unifying data and control transformations for distributed shared-memory machines , 1995, PLDI '95.

[38]  Josep Torrellas,et al.  False Sharing ans Spatial Locality in Multiprocessor Caches , 1994, IEEE Trans. Computers.

[39]  Michael F. P. O'Boyle,et al.  Non-singular data transformations: definition, validity and applications , 1997, ICS '97.

[40]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[41]  Mahmut T. Kandemir,et al.  Locality Optimization Algorithms for Compilation of Out-of-Core Codes , 1998, J. Inf. Sci. Eng..

[42]  Mahmut T. Kandemir,et al.  A matrix-based approach to the global locality optimization problem , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[43]  原田 秀逸 私の computer 環境 , 1998 .

[44]  Olivier Temam,et al.  A quantitative analysis of loop nest locality , 1996, ASPLOS VII.

[45]  Geoffrey C. Fox,et al.  The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers , 1989, Int. J. High Perform. Comput. Appl..

[46]  F. H. Mcmahon,et al.  The Livermore Fortran Kernels: A Computer Test of the Numerical Performance Range , 1986 .

[47]  Michael F. P. O'Boyle,et al.  Integrating loop and data transformations for global optimisation , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[48]  Ken Kennedy,et al.  Software methods for improvement of cache performance on supercomputer applications , 1989 .

[49]  Vivek Sarkar,et al.  Locality Analysis for Distributed Shared-Memory Multiprocessors , 1996, LCPC.

[50]  Mahmut T. Kandemir,et al.  Improving locality using loop and data transformations in an integrated framework , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[51]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[52]  Michael Wolfe,et al.  More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[53]  Susan J. Eggers,et al.  Eliminating False Sharing , 1991, ICPP.

[54]  William Jalby,et al.  Impact of cache interferences on usual numerical dense loop nests , 1993 .

[55]  Mahmut T. Kandemir,et al.  A compiler algorithm for optimizing locality in loop nests , 1997, ICS '97.

[56]  Mahmut T. Kandemir,et al.  An integer linear programming approach for optimizing cache locality , 1999, ICS '99.