Towards a static cache analysis for whole program analysis

Data caches have become very popular to overcome the gap between main memories and processors performance. Caches work well for programs with sufficient locality. Unfortunately, there are many programs that do not take advantage of them, thereby suffering large number of misses. Small modifications in the source code may change memory patterns, thereby altering the cache behaviour. If the code is modified properly, we can expect a high cache performance improvement. A detailed information about the number of misses and their causes is necessary to devise effective transformations. However, this information is very hard to obtain. This thesis describes a new method for describing the cache behaviour of whole programs. Based on a new characterisation of data reuse across multiple loop nests, we present a method, a prototyping implementation and some experimental results for analysing the cache behaviour of whole programs with regular computations. Validation against cache simulation using real codes shows the efficiency and accuracy of our method. Our method can be used to guide compiler locality optimisations and improve cache simulation performance.

[1]  Siddhartha Chatterjee,et al.  Exact analysis of the cache behavior of nested loops , 2001, PLDI '01.

[2]  Vivek Sarkar,et al.  On Estimating and Enhancing Cache Effectiveness , 1991, LCPC.

[3]  Ken Kennedy,et al.  Compiler blockability of numerical algorithms , 1992, Proceedings Supercomputing '92.

[4]  Dionisios N. Pnevmatikatos,et al.  Cache performance of the SPEC92 benchmark suite , 1993, IEEE Micro.

[5]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[6]  James R. Larus,et al.  Exploiting hardware performance counters with flow and context sensitive profiling , 1997, PLDI '97.

[7]  William Jalby,et al.  XOR-Schemes: A Flexible Data Organization in Parallel Memories , 1985, ICPP.

[8]  José González,et al.  The design and performance of a conflict-avoiding cache , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[9]  John L. Hennessy,et al.  The accuracy of trace-driven simulations of multiprocessors , 1993, SIGMETRICS '93.

[10]  Sharad Malik,et al.  Cache miss equations: an analytical representation of cache misses , 1997, ICS '97.

[11]  Elana D. Granston,et al.  A Cache Visualization Tool , 1997, Computer.

[12]  Olivier Temam,et al.  A quantitative analysis of loop nest locality , 1996, ASPLOS VII.

[13]  Keshav Pingali,et al.  Data-centric multi-level blocking , 1997, PLDI '97.

[14]  Ken Kennedy,et al.  Analyzing and visualizing performance of memory hierarchies , 1990 .

[15]  Jingling Xue,et al.  Loop Tiling for Parallelism , 2000, Kluwer International Series in Engineering and Computer Science.

[16]  Reinhard Wilhelm,et al.  Efficient and Precise Cache Behavior Prediction for Real-Time Systems , 1999, Real-Time Systems.

[17]  David T. Harper,et al.  Vector Access Performance in Parallel Memories Using a Skewed Storage Scheme , 1987, IEEE Transactions on Computers.

[18]  David B. Whalley,et al.  Bounding worst-case instruction cache performance , 1994, 1994 Proceedings Real-Time Systems Symposium.

[19]  Walter L. Smith Probability and Statistics , 1959, Nature.

[20]  John L. Hennessy,et al.  Performance debugging shared memory multiprocessor programs with MTOOL , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[21]  Zhiyuan Li,et al.  New tiling techniques to improve cache temporal locality , 1999, PLDI '99.

[22]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[23]  Duncan H. Lawrie,et al.  The Prime Memory System for Array Access , 1982, IEEE Transactions on Computers.

[24]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[25]  Sharad Malik,et al.  Efficient microarchitecture modeling and path analysis for real-time software , 1995, Proceedings 16th IEEE Real-Time Systems Symposium.

[26]  Olivier Temam,et al.  Cache interference phenomena , 1994, SIGMETRICS.

[27]  Sally A. McKee,et al.  Caches As Filters: A Unifying Model for Memory Hierarchy Analysis , 2000 .

[28]  Josep Llosa,et al.  Near-Optimal Padding for Removing Conflict Misses , 2002, LCPC.

[29]  Trevor N. Mudge,et al.  Trace-driven memory simulation: a survey , 1997, CSUR.

[30]  Sharad Malik,et al.  Cache miss equations: a compiler framework for analyzing and tuning memory behavior , 1999, TOPL.

[31]  Constantine D. Polychronopoulos,et al.  Symbolic Analysis: A Basis for Parallelization, Optimization, and Scheduling of Programs , 1993, LCPC.

[32]  Josep Llosa,et al.  Near-optimal loop tiling by means of cache miss equations and genetic algorithms , 2002, Proceedings. International Conference on Parallel Processing Workshop.

[33]  Robert C. Bedichek Talisman: fast and accurate multicomputer simulation , 1995, SIGMETRICS '95/PERFORMANCE '95.

[34]  David B. Whalley,et al.  Integrating the timing analysis of pipelining and instruction caching , 1995, Proceedings 16th IEEE Real-Time Systems Symposium.

[35]  Mendel Rosenblum,et al.  Embra: fast and flexible machine simulation , 1996, SIGMETRICS '96.

[36]  W. Jalby,et al.  To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts , 1993, Supercomputing '93.

[37]  B. Ramakrishna Rau,et al.  Pseudo-randomly interleaved memory , 1991, ISCA '91.

[38]  Jingling Xue,et al.  Let's study whole-program cache behaviour analytically , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[39]  Margaret Martonosi,et al.  MemSpy: analyzing memory system bottlenecks in programs , 1992, SIGMETRICS '92/PERFORMANCE '92.

[40]  Josep Llosa,et al.  Optimizing cache miss equations polyhedra , 2000, CARN.

[41]  William Pugh,et al.  Counting solutions to Presburger formulas: how and why , 1994, PLDI '94.

[42]  Olivier Temam,et al.  To copy or not to copy: A compile-time technique for assessing when data copying should be used to eliminate cache conflicts , 1993, Supercomputing '93. Proceedings.

[43]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[44]  Chau-Wen Tseng,et al.  Compiler optimizations for improving data locality , 1994, ASPLOS VI.

[45]  Ken Kennedy,et al.  Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.

[46]  Yuri Ermoliev,et al.  Numerical techniques for stochastic optimization , 1988 .

[47]  Philippe Clauss,et al.  Counting solutions to linear and nonlinear constraints through Ehrhart polynomials: applications to analyze and transform scientific programs , 1996 .

[48]  Sang Lyul Min,et al.  An accurate worst case timing analysis technique for RISC processors , 1994, 1994 Proceedings Real-Time Systems Symposium.

[49]  Sang Lyul Min,et al.  Efficient worst case timing analysis of data caching , 1996, Proceedings Real-Time Technology and Applications.

[50]  Chau-Wen Tseng,et al.  Eliminating conflict misses for high performance architectures , 1998, ICS '98.

[51]  Ken Kennedy,et al.  Software methods for improvement of cache performance on supercomputer applications , 1989 .

[52]  Mahmut T. Kandemir,et al.  Improving locality using loop and data transformations in an integrated framework , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[53]  Jingling Xue,et al.  Unimodular Transformations of Non-Perfectly Nested Loops , 1997, Parallel Comput..

[54]  Peter S. Magnusson A Design for Efficient Simulation of a Multiprocessor , 1993, MASCOTS.

[55]  Michael E. Wolf,et al.  Improving locality and parallelism in nested loops , 1992 .

[56]  William Pugh,et al.  The Omega test: A fast and practical integer programming algorithm for dependence analysis , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[57]  Walid Abu-Sufah,et al.  Improving the performance of virtual memory computers. , 1979 .

[58]  Emilio L. Zapata,et al.  Automatic analytical modeling for the estimation of cache misses , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[59]  Chau-Wen Tseng,et al.  Data transformations for eliminating conflict misses , 1998, PLDI.

[60]  Michael Wolfe,et al.  Advanced Loop Interchanging , 1986, ICPP.

[61]  Sharad Malik,et al.  Automated cache optimizations using CME driven diagnosis , 2000, ICS '00.

[62]  Josep Llosa,et al.  An efficient solver for Cache Miss Equations , 2000, 2000 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS (Cat. No.00EX422).

[63]  Reinhard Wilhelm,et al.  Cache Behavior Prediction by Abstract Interpretation , 1996, Sci. Comput. Program..

[64]  Sharad Malik,et al.  Precise miss analysis for program transformations with caches of arbitrary associativity , 1998, ASPLOS VIII.

[65]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[66]  Jingling Xue,et al.  Reuse-Driven Tiling for Data Locality , 1997, LCPC.

[67]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[68]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[69]  David B. Whalley,et al.  Timing analysis for data caches and set-associative caches , 1997, Proceedings Third IEEE Real-Time Technology and Applications Symposium.

[70]  Zbigniew Michalewicz,et al.  Genetic Algorithms + Data Structures = Evolution Programs , 1996, Springer Berlin Heidelberg.

[71]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[72]  Doran Wilde,et al.  A LIBRARY FOR DOING POLYHEDRAL OPERATIONS , 2000 .

[73]  David A. Wood,et al.  Cache profiling and the SPEC benchmarks: a case study , 1994, Computer.