Estimating cache misses and locality using stack distances

Cache behavior modeling is an important part of modern optimizing compilers. In this paper we present a method to estimate the number of cache misses, at compile time, using a machine independent model based on stack algorithms. Our algorithm computes the stack histograms symbolically, using data dependence distance vectors and is totally accurate when dependence distances are uniformly generated. The stack histogram models accurately fully associative caches with LRU replacement policy, and provides a very good approximation for set-associative caches and programs with non-constant dependence distances.The stack histogram is an accurate, machine-independent metric of locality. Compilers using this metric can evaluate optimizations with respect to memory behavior. We illustrate this use of the stack histogram by comparing three locality enhancing transformations: tiling, data shackling and the product-space transformation. Additionally, the stack histogram model can be used to compute optimal parameters for data locality transformations, such as the tile size for loop tiling.

[1]  Alan Jay Smith,et al.  A Comparative Study of Set Associative Memory Mapping Algorithms and Their Use for Cache and Main Memory , 1978, IEEE Transactions on Software Engineering.

[2]  Josep Torrellas,et al.  Adaptively Mapping Code in an Intelligent Memory Architecture , 2000, Intelligent Memory Systems.

[3]  Thomas Fahringer,et al.  Estimating Cache Performance for Sequential and Data Parallel Programs , 1997, HPCN Europe.

[4]  Yunheung Paek,et al.  Parallel Programming with Polaris , 1996, Computer.

[5]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[6]  David Padua,et al.  Compile-time performance prediction of scientific programs , 2000 .

[7]  Jingling Xue,et al.  Let's study whole-program cache behaviour analytically , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[8]  Michael Wolfe,et al.  More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[9]  Sharad Malik,et al.  Precise miss analysis for program transformations with caches of arbitrary associativity , 1998, ASPLOS VIII.

[10]  Kathryn S. McKinley,et al.  Automatic and interactive parallelization , 1992 .

[11]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[12]  David A. Padua,et al.  Calculating stack distances efficiently , 2002, MSP/ISMM.

[13]  Gang Ren,et al.  A comparison of empirical and model-driven optimization , 2003, PLDI '03.

[14]  Richard E. Hank,et al.  Region-based compilation: an introduction and motivation , 1995, MICRO 1995.

[15]  Irving L. Traiger,et al.  Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[16]  Keshav Pingali,et al.  Data-centric multi-level blocking , 1997, PLDI '97.

[17]  Yunheung Paek,et al.  Simplification of array access patterns for compiler optimizations , 1998, PLDI.

[18]  Tarek S. Abdelrahman,et al.  Fusion of Loops for Parallelism and Locality , 1997, IEEE Trans. Parallel Distributed Syst..

[19]  Kathryn S. McKinley,et al.  A Compiler Optimization Algorithm for Shared-Memory Multiprocessors , 1998, IEEE Trans. Parallel Distributed Syst..

[20]  D. Padua,et al.  Experimental Evaluation of Some Data Dependence Tests (extended Abstract) , 1991 .

[21]  David A. Padua,et al.  Compile-Time Based Performance Prediction , 1999, LCPC.

[22]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[23]  Todd C. Mowry,et al.  Tolerating latency through software-controlled data prefetching , 1994 .

[24]  Alan Jay Smith,et al.  Evaluating Associativity in CPU Caches , 1989, IEEE Trans. Computers.

[25]  Dean M. Tullsen,et al.  Compilation issues for a simultaneous multithreading processor , 1996 .

[26]  Josep Llosa,et al.  A fast implementation of cache miss equations , 2000 .

[27]  Yunheung Paek,et al.  Efficient and precise array access analysis , 2002, TOPL.

[28]  Siddhartha Chatterjee,et al.  Exact analysis of the cache behavior of nested loops , 2001, PLDI '01.

[29]  Barton P. Miller,et al.  Delphi: an integrated, language-directed performance prediction, measurement and analysis environment , 1999, Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation.

[30]  S. Parekh,et al.  Tuning Compiler Optimizations for Simultaneous Multithreading , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[31]  David A. Padua,et al.  Compiler analysis of irregular memory accesses , 2000, PLDI '00.

[32]  Ken Kennedy,et al.  Software methods for improvement of cache performance on supercomputer applications , 1989 .

[33]  Josep Torrellas,et al.  Automatically mapping code on an intelligent memory architecture , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[34]  Keshav Pingali,et al.  Synthesizing Transformations for Locality Enhancement of Imperfectly-Nested Loop Nests , 2001, International Journal of Parallel Programming.

[35]  Vivek Sarkar,et al.  On Estimating and Enhancing Cache Effectiveness , 1991, LCPC.

[36]  Dennis Gannon,et al.  Strategies for cache and local memory management by global program transformation , 1988, J. Parallel Distributed Comput..