Improving locality using loop and data transformations in an integrated framework

This paper presents a new integrated compiler framework for improving the cache performance of scientific applications. In addition to applying loop transformations, the method includes data layout optimizations, i.e., those that change the memory layouts of data structures (arrays in this case). A key characteristic of this approach is that loop transformations are used to improve temporal locality while data layout optimizations are used to improve spatial locality. This optimization framework was used with sixteen loop nests from several benchmarks and math libraries, and the performance was measured using a cache simulator in addition to using a single node of the SGI Origin 2000 distributed-shared-memory machine for measuring actual execution times. The results demonstrate that this approach is very effective in improving locality and outperforms current solutions that use either loop or data transformations alone. We expect that our solution will also enable better register usage due to increased temporal locality in the innermost loop, and that it will help in eliminating false-sharing on multiprocessors due to exploiting spatial locality in the innermost loop.

[1]  P. Sadayappan,et al.  Communication-Free Hyperplane Partitioning of Nested Loops , 1993, J. Parallel Distributed Comput..

[2]  John Zahorjan,et al.  Optimizing Data Locality by Array Restructuring , 1995 .

[3]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[4]  Wei Li,et al.  Compiling for NUMA Parallel Machines , 1993 .

[5]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[6]  Mahmut T. Kandemir,et al.  A hyperplane based approach for optimizing spatial locality in loop nests , 1998, ICS '98.

[7]  Mahmut T. Kandemir,et al.  Compiler algorithms for optimizing locality and parallelism on shared and distributed memory machines , 1997, Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques.

[8]  Alan Jay Smith,et al.  Evaluating Associativity in CPU Caches , 1989, IEEE Trans. Computers.

[9]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[10]  William Jalby,et al.  A strategy for array management in local memory , 1994, Math. Program..

[11]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[12]  Alexander Schrijver,et al.  Theory of linear and integer programming , 1986, Wiley-Interscience series in discrete mathematics and optimization.

[13]  Keshav Pingali,et al.  Data-centric multi-level blocking , 1997, PLDI '97.

[14]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[15]  Monica S. Lam,et al.  Data and computation transformations for multiprocessors , 1995, PPOPP '95.

[16]  Wei Li,et al.  Unifying data and control transformations for distributed shared-memory machines , 1995, PLDI '95.

[17]  Michael F. P. O'Boyle,et al.  Non-singular data transformations: definition, validity and applications , 1997, ICS '97.

[18]  Chau-Wen Tseng,et al.  Compiler optimizations for improving data locality , 1994, ASPLOS VI.