Intermediately executed code is the key to find refactorings that improve temporal data locality

The growing speed gap between memory and processor makes an efficient use of the cache ever more important to reach high performance. One of the most important ways to improve cache behavior is to increase the data locality. While many cache analysis tools have been developed, most of them only indicate the locations in the code where cache misses occur. Often, optimizing the program, even after pinpointing the cache bottlenecks in the source code, remains hard with these tools.In this paper, we present two related tools that not only pinpoint the locations of cache misses, but also suggest source code refactorings which improve temporal locality and thereby eliminate the majority of the cache misses. In both tools, the key to find the appropriate refactorings is an analysis of the code executed between a data use and the next use of the same data, which we call the Intermediately Executed Code (IEC). The first tool, the Reuse Distance VISualizer (RDVIS), performs a clustering on the IECs, which reduces the amount of work to find required refactorings. The second tool, SLO (short for "Suggestions for Locality Optimizations"), suggests a number of refactorings by analyzing the call graph and loop structure of the IEC. Using these tools, we have pinpointed the most important optimizations for a number of SPEC2000 programs, resulting in an average speedup of 2.3 on a number of different platforms.

[1]  Marc Atkins,et al.  PC Software Performance Tuning , 1996, Computer.

[2]  Wolfgang Karl,et al.  YACO: A User Conducted Visualization Tool for Supporting Cache Optimization , 2005, HPCC.

[3]  Kristof Beyls,et al.  RDVIS: A Tool that Visualizes the Causes of Low Locality and Hints Program Optimizations , 2005, International Conference on Computational Science.

[4]  Chau-Wen Tseng,et al.  Improving Locality for Adaptive Irregular Scientific Codes , 2000, LCPC.

[5]  Pat Hanrahan,et al.  Rivet: a flexible environment for computer systems visualization , 2000, SIGGRAPH 2000.

[6]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[7]  David A. Wood,et al.  Cache profiling and the SPEC benchmarks: a case study , 1994, Computer.

[8]  David G. Wonnacott,et al.  Achieving Scalable Locality with Time Skewing , 2002, International Journal of Parallel Programming.

[9]  Elana D. Granston,et al.  A Cache Visualization Tool , 1997, Computer.

[10]  Mahmut T. Kandemir,et al.  Data space-oriented tiling for enhancing locality , 2005, TECS.

[11]  Keshav Pingali,et al.  Data-Centric Transformations for Locality Enhancement , 2001, International Journal of Parallel Programming.

[12]  John L. Hennessy,et al.  Mtool: An Integrated System for Performance Debugging Shared Memory Multiprocessor Applications , 1993, IEEE Trans. Parallel Distributed Syst..

[13]  Vikram S. Adve,et al.  Automatic pool allocation: improving performance by controlling data structure layout in the heap , 2005, PLDI '05.

[14]  Larry Carter,et al.  Sparse Tiling for Stationary Iterative Methods , 2004, Int. J. High Perform. Comput. Appl..

[15]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[16]  Ken Kennedy,et al.  Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings , 2001, International Journal of Parallel Programming.

[17]  James R. Larus,et al.  Cache-conscious structure definition , 1999, PLDI '99.

[18]  Josef Weidendorfer,et al.  A Tool Suite for Simulation Based Analysis of Memory Access Behavior , 2004, International Conference on Computational Science.

[19]  Ken Kennedy,et al.  Improving effective bandwidth through compiler enhancement of global cache reuse , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[20]  Paul Feautrier,et al.  Improving Data Locality by Chunking , 2003, CC.

[21]  Chen Ding,et al.  Array regrouping and structure splitting using whole-program reference affinity , 2004, PLDI '04.

[22]  Mahmut T. Kandemir,et al.  An integer linear programming approach for optimizing cache locality , 1999, ICS '99.

[23]  Margaret Martonosi,et al.  Tuning Memory Performance of Sequential and Parallel Programs , 1995, Computer.

[24]  Steve Carr,et al.  Instruction based memory distance analysis and its application to optimization , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[25]  Erik Hagersten,et al.  Fast data-locality profiling of native execution , 2005, SIGMETRICS '05.

[26]  Yijun Yu,et al.  Visualizing the impact of the cache on program execution , 2001, Proceedings Fifth International Conference on Information Visualisation.

[27]  Kristof Beyls,et al.  Generating cache hints for improved program efficiency , 2005, J. Syst. Archit..

[28]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[29]  David Parello,et al.  Facilitating the search for compositions of program transformations , 2005, ICS '05.

[30]  Erik Hagersten,et al.  SIP: Performance Tuning through Source Code Interdependence , 2002, Euro-Par.

[31]  Robert J. Fowler,et al.  HPCVIEW: A Tool for Top-down Analysis of Node Performance , 2002, The Journal of Supercomputing.

[32]  Chau-Wen Tseng,et al.  Software Support For Improving Locality in Scientific Codes , 2001 .

[33]  Zhiyuan Li,et al.  New tiling techniques to improve cache temporal locality , 1999, PLDI '99.