Improving Cache Locality by a Combination of Loop and Data Transformation

Exploiting locality of reference is key to realizing high levels of performance on modern processors. This paper describes a compiler algorithm for optimizing cache locality in scientific codes on uniprocessor and multiprocessor machines. A distinctive characteristic of our algorithm is that it considers loop and data layout transformations in a unified framework. Our approach is very effective at reducing cache misses and can optimize some nests for which optimization techniques based on loop transformations alone are not successful. An important special case is one in which data layouts of some arrays are fixed and cannot be changed. We show how our algorithm can accommodate this case and demonstrate how it can be used to optimize multiple loop nests. Experiments on several benchmarks show that the techniques presented in this paper result in substantial improvement in cache performance.

[1]  Alan Jay Smith,et al.  Evaluating Associativity in CPU Caches , 1989, IEEE Trans. Computers.

[2]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[3]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[4]  Wei Li,et al.  Compiling for NUMA Parallel Machines , 1993 .

[5]  Steven W. K. Tjiang,et al.  SUIF: an infrastructure for research on parallelizing and optimizing compilers , 1994, SIGP.

[6]  Josep Torrellas,et al.  False Sharing ans Spatial Locality in Multiprocessor Caches , 1994, IEEE Trans. Computers.

[7]  Veljko M. Milutinovic,et al.  A survey of software solutions for maintenance of cache consistency in shared memory multiprocessors , 1993, Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences.

[8]  M. Valero,et al.  A data cache with multiple caching strategies tuned to different types of locality , 1995, ICS '95.

[9]  Wei Li,et al.  Unifying data and control transformations for distributed shared-memory machines , 1995, PLDI '95.

[10]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[11]  Susan J. Eggers,et al.  Reducing false sharing on shared memory multiprocessors through compile time data transformations , 1995, PPOPP '95.

[12]  Monica S. Lam,et al.  Data and computation transformations for multiprocessors , 1995, PPOPP '95.

[13]  Veljko Milutinovic,et al.  The Cache Coherence Problem in Shared-Memory Multiprocessors: Software Solutions , 1996 .

[14]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[15]  Keshav Pingali,et al.  Data-centric multi-level blocking , 1997, PLDI '97.

[16]  Steve Carr,et al.  Compiler blockability of dense matrix factorizations , 1997, TOMS.

[17]  Mahmut T. Kandemir,et al.  A compiler algorithm for optimizing locality in loop nests , 1997, ICS '97.

[18]  Mateo Valero,et al.  Static locality analysis for cache management , 1997, Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques.

[19]  Wei Li,et al.  Briki: an optimizing Java compiler , 1997, Proceedings IEEE COMPCON 97. Digest of Papers.