A Quantitative Analysis of Tile Size Selection Algorithms

Loop tiling is an effective optimizing transformation to boost the memory performance of a program, especially for dense matrix scientific computations. The magnitude and stability of the achieved performance improvements are heavily dependent on the appropriate selection of tile sizes. Many existing tile selection algorithms try to find tile sizes which eliminate self-interference cache conflict misses, maximize cache utilization, and minimize cross-interference cache conflict misses. These techniques depend heavily on the actual layout of the arrays in memory. Array padding, an effective data layout optimization technique, is therefore incorporated by many algorithms to help loop tiling stabilize its effectiveness by avoiding “pathological” array sizes.In this paper, we examine several such combined algorithms in terms of cost-benefit trade-offs, and introduce a new algorithm. The preliminary experimental results show that more precise and costly tile selection and array padding algorithms may not be justified by the resulting performance improvements since such improvements may also be achieved by much simpler and therefore less expensive strategies. The key issues in finding a good tiling algorithm are (1) to identify critical performance factors and (2) to develop corresponding performance models that allow predictions at a sufficient level of accuracy. Following this insight, we have developed a new tiling algorithm that performs better than previous algorithms in terms of execution time and stability, and generates code with a performance comparable to the best measured algorithm. Experimental results on two standard benchmark kernels for matrix multiply and LU factorization show that the new algorithm is orders of magnitude faster than the best previous algorithm without sacrificing stability and execution speed of the generated code.

[1]  Vivek Sarkar,et al.  On Estimating and Enhancing Cache Effectiveness , 1991, LCPC.

[2]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[3]  Vivek Sarkar,et al.  A compiler framework for restructuring data declarations to enhance cache and TLB effectiveness , 1994, CASCON.

[4]  Sharad Malik,et al.  Cache miss equations: an analytical representation of cache misses , 1997, ICS '97.

[5]  Frank Kuehndel Software Methods for Avoiding Cache Conflicts , 1998 .

[6]  David H. Bailey Unfavorable Strides in Cache Memory Systems (RNR Technical Report RNR-92-015) , 1995, Sci. Program..

[7]  Chau-Wen Tseng,et al.  Locality Optimizations for Multi-Level Caches , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[8]  Michael Wolfe,et al.  Iteration Space Tiling for Memory Hierarchies , 1987, PPSC.

[9]  Graham R. Nudd,et al.  Predicting the Cache Miss Ratio of Loop-Nested Array References , 1997 .

[10]  William Pugh,et al.  Counting solutions to Presburger formulas: how and why , 1994, PLDI '94.

[11]  Sharad Malik,et al.  Cache miss equations: a compiler framework for analyzing and tuning memory behavior , 1999, TOPL.

[12]  Graham R. Nudd,et al.  Analytical Modeling of Set-Associative Cache Behavior , 1999, IEEE Trans. Computers.

[13]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[14]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[15]  Ulrich Kremer,et al.  Tile Selection Algorithms and their Performance Models , 1999 .

[16]  Zhiyuan Li,et al.  New tiling techniques to improve cache temporal locality , 1999, PLDI '99.

[17]  Tarek S. Abdelrahman,et al.  Array Data Layout for the Reduction of Cache Conflicts , 2001 .

[18]  David A. Patterson,et al.  Computer Architecture - A Quantitative Approach, 5th Edition , 1996 .

[19]  Chau-Wen Tseng,et al.  Eliminating conflict misses for high performance architectures , 1998, ICS '98.

[20]  Keshav Pingali,et al.  An experimental evaluation of tiling and shackling for memory hierarchy management , 1999, ICS '99.

[21]  Allan Porterfield,et al.  Data cache performance of supercomputer applications , 1990, Proceedings SUPERCOMPUTING '90.

[22]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[23]  Michael E. Wolf,et al.  Combining Loop Transformations Considering Caches and Scheduling , 2004, International Journal of Parallel Programming.

[24]  Steve Carr,et al.  Unroll-and-jam using uniformly generated sets , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[25]  Chau-Wen Tseng,et al.  Data transformations for eliminating conflict misses , 1998, PLDI.

[26]  William Jalby,et al.  A Quantitative Algorithm for Data Locality Optimization , 1991, Code Generation.

[27]  Karim Esseghir Improving data locality for caches , 1993 .

[28]  Philippe Clauss Counting Solutions to Linear and Nonlinear Constraints Through Ehrhart Polynomials: Applications to Analyze and Transform Scientific Programs , 1996, International Conference on Supercomputing.

[29]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[30]  Larry Carter,et al.  Quantifying the Multi-Level Nature of Tiling Interactions , 1997, International Journal of Parallel Programming.

[31]  David A. Patterson,et al.  Computer architecture (2nd ed.): a quantitative approach , 1996 .

[32]  Chau-Wen Tseng,et al.  A Comparison of Compiler Tiling Algorithms , 1999, CC.

[33]  Olivier Temam,et al.  A quantitative analysis of loop nest locality , 1996, ASPLOS VII.

[34]  David H. Bailey Unfavorable strides in cache memory systems , 1992 .

[35]  Olivier Temam,et al.  Cache interference phenomena , 1994, SIGMETRICS.

[36]  Jacqueline Chame,et al.  A tile selection algorithm for data locality and cache interference , 1999, ICS '99.

[37]  Alan Jay Smith,et al.  Evaluating Associativity in CPU Caches , 1989, IEEE Trans. Computers.

[38]  Hiroshi Nakamura,et al.  Augmenting Loop Tiling with Data Alignment for Improved Cache Performance , 1999, IEEE Trans. Computers.

[39]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.