Compiler optimizations for improving data locality

In the past decade, processor speed has become significantly faster than memory speed. Small, fast cache memories are designed to overcome this discrepancy, but they are only effective when programs exhibit data locality. In this paper, we present compiler optimizations to improve data locality based on a simple yet accurate cost model. The model computes both temporal and spatial reuse of cache lines to find desirable loop organizations. The cost model drives the application of compound transformations consisting of loop permutation, loop fusion, loop distribution, and loop reversal. We demonstrate that these program transformations are useful for optimizing many programs. To validate our optimization strategy, we implemented our algorithms and ran experiments on a large collection of scientific programs and kernels. Experiments with kernels illustrate that our model and algorithm can select and achieve the best performance. For over thirty complete applications, we executed the original and transformed versions and simulated cache hit rates. We collected statistics about the inherent characteristics of these programs and our ability to improve their data locality. To our knowledge, these studies are the first of such breadth and depth. We found performance improvements were difficult to achieve because benchmark programs typically have high hit rates even for small data caches; however, our optimizations significantly improved several programs.

[1]  Walid Abu-Sufah,et al.  Improving the performance of virtual memory computers. , 1979 .

[2]  David A. Padua,et al.  Dependence graphs and compiler optimizations , 1981, POPL '81.

[3]  Joe D. Warren,et al.  A hierarchical basis for reordering transformations , 1984, POPL '84.

[4]  Ken Kennedy,et al.  Parascope:a Parallel Programming Environment , 1988 .

[5]  Michael Wolfe,et al.  Iteration Space Tiling for Memory Hierarchies , 1987, PPSC.

[6]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[7]  Ken Kennedy,et al.  Estimating Interlock and Improving Balance for Pipelined Architectures , 1988, J. Parallel Distributed Comput..

[8]  Ken Kennedy,et al.  Improving register allocation for subscripted variables , 1990, PLDI '90.

[9]  Ken Kennedy,et al.  Practical dependence testing , 1991, PLDI '91.

[10]  Ken Kennedy,et al.  Interprocedural transformations for parallel code generation , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[11]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[12]  Vivek Sarkar,et al.  On Estimating and Enhancing Cache Effectiveness , 1991, LCPC.

[13]  Michael Wolfe,et al.  The Tiny Loop Restructuring Research Tool , 1991, ICPP.

[14]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[15]  Ken Kennedy,et al.  Optimizing for parallelism and data locality , 1992, ICS '92.

[16]  Michael E. Wolf,et al.  Improving locality and parallelism in nested loops , 1992 .

[17]  Keshav Pingali,et al.  Access normalization: loop restructuring for NUMA compilers , 1992, ASPLOS V.

[18]  Kathryn S. McKinley,et al.  Automatic and interactive parallelization , 1992 .

[19]  Ken Kennedy,et al.  Analysis and transformation in an interactive parallel programming tool , 1993, Concurr. Pract. Exp..

[20]  Ken Kennedy,et al.  Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.

[21]  Steven Mark Carr,et al.  Memory-hierarchy management , 1993 .

[22]  K. Cooper,et al.  A Methodology for Procedure Cloning , 1993, Comput. Lang..

[23]  Compiler Optimizations for Improving Data Locality , 1994, ASPLOS.

[24]  Ken Kennedy,et al.  Scalar replacement in the presence of conditional control flow , 1994, Softw. Pract. Exp..