Improving cache locality with blocked array layouts

Minimizing cache misses is one of the most important factors to reduce average latency for memory accesses. Tiled codes modify the instruction stream to exploit cache locality for array accesses. Here, we further reduce cache misses, restructuring the memory layout of multidimensional arrays, that are accessed by tiled instruction code. In our method, array elements are stored in a blocked way, exactly as they are swept by the tiled instruction stream. We present a straightforward way to easily translate multidimensional indexing of arrays into their blocked memory layout using simple binary-mask operations. Indices for such array layouts are easily calculated based on the algebra of dilated integers, similarly to morton-order indexing. Actual experimental results, using matrix multiplication and LU-decomposition on various size arrays, illustrate that execution time is greatly improved when combining tiled code with tiled array layouts and binary mask-based index translation functions. Simulations using the Simplescalar tool, verify that enhanced performance is due to the considerable reduction of total cache misses.

[1]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[2]  David S. Wise Ahnentafel Indexing into Morton-Ordered Arrays, or Matrix Locality for Free , 2000, Euro-Par.

[3]  Wei Li,et al.  Unifying data and control transformations for distributed shared-memory machines , 1995, PLDI '95.

[4]  Chun-Yuan Lin,et al.  Efficient Representation Scheme for Multidimensional Array Operations , 2002, IEEE Trans. Computers.

[5]  Chau-Wen Tseng,et al.  Data transformations for eliminating conflict misses , 1998, PLDI.

[6]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[7]  Mithuna Thottethodi,et al.  Recursive array layouts and fast parallel matrix multiplication , 1999, SPAA '99.

[8]  Viktor K. Prasanna,et al.  Tiling, Block Data Layout, and Memory Hierarchy Performance , 2003, IEEE Trans. Parallel Distributed Syst..

[9]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[10]  Mithuna Thottethodi,et al.  Nonlinear array layouts for hierarchical memory systems , 1999, ICS '99.

[11]  Keshav Pingali,et al.  An experimental evaluation of tiling and shackling for memory hierarchy management , 1999, ICS '99.

[12]  Norbert Podhorszki,et al.  Proceedings of the 10th Euromicro conference on Parallel, distributed and network-based processing , 2002 .

[13]  Mahmut T. Kandemir,et al.  Improving Cache Locality by a Combination of Loop and Data Transformation , 1999, IEEE Trans. Computers.

[14]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[15]  M. Castells Multilevel tiling for non-rectangular interation spaces , 1999 .

[16]  Viktor K. Prasanna,et al.  Analysis of memory hierarchy performance of block data layout , 2002, Proceedings International Conference on Parallel Processing.

[17]  David A. Wood,et al.  Cache profiling and the SPEC benchmarks: a case study , 1994, Computer.

[18]  Chau-Wen Tseng,et al.  Locality Optimizations for Multi-Level Caches , 1999, ACM/IEEE SC 1999 Conference (SC'99).