An accurate cost model for guiding data locality transformations

Caches have become increasingly important with the widening gap between main memory and processor speeds. Small and fast cache memories are designed to bridge this discrepancy. However, they are only effective when programs exhibit sufficient data locality.The performance of the memory hierarchy can be improved by means of data and loop transformations. Tiling is a loop transformation that aims at reducing capacity misses by shortening the reuse distance. Padding is a data layout transformation targeted to reduce conflict misses.This article presents an accurate cost model that describes misses across different hierarchy levels and considers the effects of other hardware components such as branch predictors. The cost model drives the application of tiling and padding transformations. We combine the cost model with a genetic algorithm to compute the tile and pad factors that enhance the program performance.To validate our strategy, we ran experiments for a set of benchmarks on a large set of modern architectures. Our results show that this scheme is useful to optimize programs' performance. When compared to previous approaches, we observe that with a reasonable compile-time overhead, our approach gives significant performance improvements for all studied kernels on all architectures.

[1]  Josep Llosa,et al.  A fast and accurate framework to analyze and optimize cache memory behavior , 2004, TOPL.

[2]  Chau-Wen Tseng,et al.  Data transformations for eliminating conflict misses , 1998, PLDI.

[3]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[4]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[5]  Mithuna Thottethodi,et al.  Nonlinear array layouts for hierarchical memory systems , 1999, ICS '99.

[6]  Chau-Wen Tseng,et al.  Eliminating conflict misses for high performance architectures , 1998, ICS '98.

[7]  Chau-Wen Tseng,et al.  Locality Optimizations for Multi-Level Caches , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[8]  S. McFarling Combining Branch Predictors , 1993 .

[9]  Olivier Temam,et al.  To copy or not to copy: A compile-time technique for assessing when data copying should be used to eliminate cache conflicts , 1993, Supercomputing '93. Proceedings.

[10]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[11]  Harsh Sharangpani,et al.  Itanium Processor Microarchitecture , 2000, IEEE Micro.

[12]  Ken Kennedy,et al.  Compiler blockability of numerical algorithms , 1992, Proceedings Supercomputing '92.

[13]  Martin E. Dyer,et al.  On the Complexity of Computing the Volume of a Polyhedron , 1988, SIAM J. Comput..

[14]  Leo Liberti,et al.  Introduction to Global Optimization , 2006 .

[15]  Josep Llosa,et al.  A Fast and Accurate Approach to Analyze Cache Memory Behavior (Research Note) , 2000, Euro-Par.

[16]  Panos M. Pardalos,et al.  Introduction to Global Optimization , 2000, Introduction to Global Optimization.

[17]  Yuri Ermoliev,et al.  Numerical techniques for stochastic optimization , 1988 .

[18]  Zbigniew Michalewicz,et al.  Genetic Algorithms Plus Data Structures Equals Evolution Programs , 1994 .

[19]  Pierre Hansen,et al.  Constrained Nonlinear 0-1 Programming , 1989 .

[20]  Dror Rawitz,et al.  The hardness of cache conscious data placement , 2002, POPL '02.

[21]  S. Vavasis Nonlinear optimization: complexity issues , 1991 .

[22]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[23]  Philippe Clauss,et al.  Counting solutions to linear and nonlinear constraints through Ehrhart polynomials: applications to analyze and transform scientific programs , 1996 .

[24]  Zbigniew Michalewicz,et al.  Genetic Algorithms + Data Structures = Evolution Programs , 1996, Springer Berlin Heidelberg.

[25]  W. Jalby,et al.  To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts , 1993, Supercomputing '93.

[26]  Chau-Wen Tseng,et al.  A Comparison of Compiler Tiling Algorithms , 1999, CC.

[27]  Olivier Temam,et al.  A quantitative analysis of loop nest locality , 1996, ASPLOS VII.

[28]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[29]  Utpal Banerjee,et al.  Dependence analysis for supercomputing , 1988, The Kluwer international series in engineering and computer science.

[30]  Michael Shebanow,et al.  Single instruction stream parallelism is greater than two , 1991, ISCA '91.

[31]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[32]  Sharad Malik,et al.  Cache miss equations: a compiler framework for analyzing and tuning memory behavior , 1999, TOPL.

[33]  Reiner Horst,et al.  Introduction to Global Optimization (Nonconvex Optimization and Its Applications) , 2002 .

[34]  Michael Wolfe,et al.  Advanced Loop Interchanging , 1986, ICPP.

[35]  Josep Llosa,et al.  An efficient solver for Cache Miss Equations , 2000, 2000 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS (Cat. No.00EX422).

[36]  Philip E. Gill,et al.  Practical optimization , 1981 .

[37]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[38]  Mahmut T. Kandemir,et al.  A Linear Algebra Framework for Automatic Determination of Optimal Data Layouts , 1999, IEEE Trans. Parallel Distributed Syst..

[39]  Chau-Wen Tseng,et al.  Compiler optimizations for improving data locality , 1994, ASPLOS VI.

[40]  Yuri Ermoliev,et al.  Stochastic programming, an introduction. Numerical techniques for stochastic optimization , 1988 .

[41]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[42]  Laszlo A. Belady,et al.  A Study of Replacement Algorithms for Virtual-Storage Computer , 1966, IBM Syst. J..

[43]  Aimo A. Törn,et al.  Global Optimization , 1999, Science.