Improving cache Performance Through Tiling and Data Alignment

We address the problem of improving the data cache performance of numerical applications — specifically, those with blocked (or tiled) loops. We present DAT, a data alignment technique utilizing array-padding, to improve program performance through minimizing cache conflict misses. We describe algorithms for selecting tile sizes for maximizing data cache utilization, and computing pad sizes for eliminating self-interference conflicts in the chosen tile. We also present a generalization of the technique to handle applications with several tiled arrays. Our experimental results comparing our technique with previous published approaches on machines with different cache configurations show consistently good performance on several benchmark programs, for a variety of problem sizes.

[1]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[2]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[3]  Karim Esseghir Improving data locality for caches , 1993 .

[4]  William H. Press,et al.  The Art of Scientific Computing Second Edition , 1998 .

[5]  Olivier Temam,et al.  To copy or not to copy: A compile-time technique for assessing when data copying should be used to eliminate cache conflicts , 1993, Supercomputing '93. Proceedings.

[6]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[7]  David A. Patterson,et al.  Computer Organization & Design: The Hardware/Software Interface , 1993 .

[8]  David A. Wood,et al.  Cache profiling and the SPEC benchmarks: a case study , 1994, Computer.

[9]  Ken Kennedy,et al.  Software prefetching , 1991, ASPLOS IV.

[10]  Tomás Lang,et al.  MOB forms: a class of multilevel block algorithms for dense linear algebra operations , 1994, ICS '94.

[11]  William H. Press,et al.  Numerical recipes in C. The art of scientific computing , 1987 .

[12]  Paul M. Embree,et al.  C Language Algorithms for Digital Signal Processing , 1991 .

[13]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[14]  W. Jalby,et al.  To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts , 1993, Supercomputing '93.

[15]  Ken Kennedy,et al.  Compiler blockability of numerical algorithms , 1992, Proceedings Supercomputing '92.

[16]  Michael Wolfe,et al.  More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).