Locality Optimizations for Multi-Level Caches

Compiler transformations can significantly improve data locality of scientific programs. In this paper, we examine the impact of multi-level caches on data locality optimizations. We find nearly all the benefits can be achieved by simply targeting the L1 (primary) cache. Most locality transformations are unaffected because they improve reuse for all levels of the cache; however, some optimizations can be enhanced. Inter-variable padding can take advantage of modular arithmetic to eliminate conflict misses and preserve group reuse on multiple cache levels. Loop fusion can balance increasing group reuse for the L2 (secondary) cache at the expense of losing group reuse at the smaller L1 cache. Tiling for the L1 cache also exploits locality available in the L2 cache. Experiments show enhanced algorithms are able to reduce cache misses, but performance improvements are rarely significant. Our results indicate existing compiler optimizations are usually sufficient to achieve good performance for multi-level caches.

[1]  Vivek Sarkar,et al.  Automatic selection of high-order transformations in the IBM XL FORTRAN compilers , 1997, IBM J. Res. Dev..

[2]  Chau-Wen Tseng,et al.  Data transformations for eliminating conflict misses , 1998, PLDI.

[3]  Tarek S. Abdelrahman,et al.  Fusion of Loops for Parallelism and Locality , 1997, IEEE Trans. Parallel Distributed Syst..

[4]  Vivek Sarkar,et al.  On Estimating and Enhancing Cache Effectiveness , 1991, LCPC.

[5]  Ken Kennedy,et al.  Compiler blockability of numerical algorithms , 1992, Proceedings Supercomputing '92.

[6]  V. Sarkar,et al.  Collective Loop Fusion for Array Contraction , 1992, LCPC.

[7]  Sharad Malik,et al.  Cache miss equations: an analytical representation of cache misses , 1997, ICS '97.

[8]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[9]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[10]  Vivek Sarkar,et al.  A compiler framework for restructuring data declarations to enhance cache and TLB effectiveness , 1994, CASCON.

[11]  Michael Wolfe,et al.  More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[12]  R. C. Whaley,et al.  Automatically Tuned Linear Algebra Software (ATLAS) , 2011, Encyclopedia of Parallel Computing.

[13]  Monica S. Lam,et al.  A Loop Transformation Theory and an Algorithm to Maximize Parallelism , 1991, IEEE Trans. Parallel Distributed Syst..

[14]  Mithuna Thottethodi,et al.  Nonlinear array layouts for hierarchical memory systems , 1999, ICS '99.

[15]  Chau-Wen Tseng,et al.  Eliminating conflict misses for high performance architectures , 1998, ICS '98.

[16]  Keshav Pingali,et al.  An experimental evaluation of tiling and shackling for memory hierarchy management , 1999, ICS '99.

[17]  Wei Li,et al.  Unifying data and control transformations for distributed shared-memory machines , 1995, PLDI '95.

[18]  Jacqueline Chame,et al.  A tile selection algorithm for data locality and cache interference , 1999, ICS '99.

[19]  Steven W. K. Tjiang,et al.  SUIF: an infrastructure for research on parallelizing and optimizing compilers , 1994, SIGP.

[20]  Chau-Wen Tseng,et al.  A Comparison of Compiler Tiling Algorithms , 1999, CC.

[21]  Mahmut T. Kandemir,et al.  Improving locality using loop and data transformations in an integrated framework , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[22]  Ken Kennedy,et al.  Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.

[23]  Zhiyuan Li,et al.  New tiling techniques to improve cache temporal locality , 1999, PLDI '99.

[24]  Olivier Temam,et al.  Cache interference phenomena , 1994, SIGMETRICS.

[25]  Kathryn S. McKinley,et al.  A Parametrized Loop Fusion Algorithm for Improving Parallelism and Cache Locality , 1997, Comput. J..

[26]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[27]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[28]  Mahmut T. Kandemir,et al.  A compiler algorithm for optimizing locality in loop nests , 1997, ICS '97.

[29]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.