Efficient and accurate analytical modeling of whole-program data cache behavior

Data caches are a key hardware means to bridge the gap between processor and memory speeds, but only for programs that exhibit sufficient data locality in their memory accesses. Thus, a method for evaluating cache performance is required to both determine quantitatively cache misses and to guide data cache optimizations. Existing analytical models for data cache optimizations target mainly isolated perfect loop nests. We present an analytical model that is capable of statically analyzing not only loop nest fragments, but also complete numerical programs with regular and compile-time predictable memory accesses. Central to the whole-program approach are abstract call inlining, memory access vectors, and parametric reuse analysis, which allow the reuse and interference both within and across loop nests to be quantified precisely in a unified framework. Based on the framework, the cache misses of a program are specified using mathematical formulas and the miss ratio is predicted from these formulas based on statistical sampling techniques. Our experimental results using kernels and whole programs indicate accurate cache miss estimates in a substantially shorter amount of time (typically, several orders of magnitude faster) than simulation.

[1]  Siddhartha Chatterjee,et al.  Exact analysis of the cache behavior of nested loops , 2001, PLDI '01.

[2]  W. Jalby,et al.  To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts , 1993, Supercomputing '93.

[3]  Jingling Xue,et al.  Let's study whole-program cache behaviour analytically , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[4]  Doran Wilde,et al.  A LIBRARY FOR DOING POLYHEDRAL OPERATIONS , 2000 .

[5]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[6]  Jingling Xue,et al.  Reuse-Driven Tiling for Improving Data Locality , 1998, International Journal of Parallel Programming.

[7]  Michael E. Wolf,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[8]  Olivier Temam,et al.  Quantifying loop nest locality using SPEC'95 and the perfect benchmarks , 1999, TOCS.

[9]  Margaret Martonosi,et al.  MemSpy: analyzing memory system bottlenecks in programs , 1992, SIGMETRICS '92/PERFORMANCE '92.

[10]  William Pugh,et al.  Counting solutions to Presburger formulas: how and why , 1994, PLDI '94.

[11]  Josep Llosa,et al.  A Fast and Accurate Approach to Analyze Cache Memory Behavior (Research Note) , 2000, Euro-Par.

[12]  Philippe Clauss,et al.  Counting solutions to linear and nonlinear constraints through Ehrhart polynomials: applications to analyze and transform scientific programs , 1996 .

[13]  Christian Lengauer,et al.  Loop Parallelization in the Polytope Model , 1993, CONCUR.

[14]  Graham R. Nudd,et al.  Analytical Modeling of Set-Associative Cache Behavior , 1999, IEEE Trans. Computers.

[15]  Scott McFarling,et al.  Program optimization for instruction caches , 1989, ASPLOS III.

[16]  Josep Llosa,et al.  Near-Optimal Padding for Removing Conflict Misses , 2002, LCPC.

[17]  Paul Feautrier,et al.  Automatic Parallelization in the Polytope Model , 1996, The Data Parallel Programming Model.

[18]  Sharad Malik,et al.  Cache miss equations: a compiler framework for analyzing and tuning memory behavior , 1999, TOPL.

[19]  Emilio L. Zapata,et al.  Modeling set associative caches behavior for irregular computations , 1998, SIGMETRICS '98/PERFORMANCE '98.

[20]  Josep Llosa,et al.  An efficient solver for Cache Miss Equations , 2000, 2000 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS (Cat. No.00EX422).

[21]  Jingling Xue,et al.  Loop Tiling for Parallelism , 2000, Kluwer International Series in Engineering and Computer Science.

[22]  Josep Torrellas,et al.  Optimizing the Instruction Cache Performance of the Operating System , 1998, IEEE Trans. Computers.

[23]  P. Feautrier Parametric integer programming , 1988 .

[24]  William Pugh,et al.  A practical algorithm for exact array dependence analysis , 1992, CACM.

[25]  Trishul M. Chilimbi Efficient representations and abstractions for quantifying and exploiting data reference locality , 2001, PLDI '01.

[26]  M. Degroot,et al.  Probability and Statistics , 2021, Examining an Operational Approach to Teaching Probability.

[27]  Dennis Gannon,et al.  Strategies for cache and local memory management by global program transformation , 1988, J. Parallel Distributed Comput..

[28]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[29]  Richard E. Kessler,et al.  Page placement algorithms for large real-indexed caches , 1992, TOCS.

[30]  Mahmut T. Kandemir,et al.  A Linear Algebra Framework for Automatic Determination of Optimal Data Layouts , 1999, IEEE Trans. Parallel Distributed Syst..

[31]  David A. Patterson,et al.  Computer architecture (2nd ed.): a quantitative approach , 1996 .

[32]  Vivek Sarkar,et al.  On Estimating and Enhancing Cache Effectiveness , 1991, LCPC.

[33]  Urs Hölzle,et al.  Eliminating Virtual Function Calls in C++ Programs , 1996, ECOOP.

[34]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[35]  Vivek Sarkar,et al.  A compiler framework for restructuring data declarations to enhance cache and TLB effectiveness , 1994, CASCON.

[36]  Olivier Temam,et al.  Characterizing the behavior of sparse algorithms on caches , 1992, Proceedings Supercomputing '92.

[37]  Mithuna Thottethodi,et al.  Nonlinear array layouts for hierarchical memory systems , 1999, ICS '99.

[38]  Ken Kennedy,et al.  Software methods for improvement of cache performance on supercomputer applications , 1989 .

[39]  Trevor N. Mudge,et al.  Trace-driven memory simulation: a survey , 1997, CSUR.

[40]  Steven W. K. Tjiang,et al.  SUIF: an infrastructure for research on parallelizing and optimizing compilers , 1994, SIGP.

[41]  Olivier Temam,et al.  Cache interference phenomena , 1994, SIGMETRICS.

[42]  William Pugh,et al.  The Omega test: A fast and practical integer programming algorithm for dependence analysis , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[43]  Emilio L. Zapata,et al.  Automatic analytical modeling for the estimation of cache misses , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[44]  Chau-Wen Tseng,et al.  Data transformations for eliminating conflict misses , 1998, PLDI.

[45]  Chau-Wen Tseng,et al.  A Comparison of Locality Transformations for Irregular Codes , 2000, LCR.

[46]  David B. Whalley,et al.  Timing analysis for data caches and set-associative caches , 1997, Proceedings Third IEEE Real-Time Technology and Applications Symposium.

[47]  Scott A. Mahlke,et al.  Profile‐guided automatic inline expansion for C programs , 1992, Softw. Pract. Exp..

[48]  Toshiaki Yasue,et al.  An Empirical Study of Method In-lining for a Java Just-in-Time Compiler , 2002, Java Virtual Machine Research and Technology Symposium.

[49]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[50]  Ken Kennedy,et al.  Improving effective bandwidth through compiler enhancement of global and dynamic cache reuse , 2000 .

[51]  Michael D. Smith,et al.  Procedure placement using temporal-ordering information , 1999, TOPL.

[52]  Zhiyuan Li,et al.  New tiling techniques to improve cache temporal locality , 1999, PLDI '99.

[53]  Jingling Xue,et al.  Unimodular Transformations of Non-Perfectly Nested Loops , 1997, Parallel Comput..

[54]  Ken Kennedy,et al.  Improving effective bandwidth through compiler enhancement of global cache reuse , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[55]  James R. Larus,et al.  Exploiting hardware performance counters with flow and context sensitive profiling , 1997, PLDI '97.

[56]  Ken Kennedy,et al.  Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings , 2001, International Journal of Parallel Programming.

[57]  Keshav Pingali,et al.  Data-centric multi-level blocking , 1997, PLDI '97.