Increasing Temporal Locality with Skewing and Recursive Blocking

We present a strategy, called recursive prismatic time skewing, that increase temporal reuse at all memory hierarchy levels, thus improving the performance of scientific codes that use iterative methods. Prismatic time skewing partitions iteration space of multiple loops into skewed prisms with both spatial and temporal (or convergence) dimensions. Novel aspects of this work include: multi-dimensional loop skewing; handling carried data dependences in the skewed loops without additional storage; bi-directional skewing to accommodate periodic boundary conditions; and an analysis and transformation strategy that works inter-procedurally. We combine prismatic skewing with a recursive blocking strategy to boost reuse at all levels in a memory hierarchy. A preliminary evaluation of these techniques shows significant performance improvements compared both to original codes and to methods described previously in the literature. With an inter-procedural application of our techniques, we were able to reduce total primary cache misses of a large application code by 27% and secondary cache misses by 119%.

[1]  Keshav Pingali,et al.  Tiling Imperfectly-nested Loop Nests , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[2]  Keshav Pingali,et al.  Automatic Generation of Block-Recursive Codes , 2000, Euro-Par.

[3]  Ken Kennedy,et al.  Transforming loops to recursion for multi-level memory hierarchies , 2000, PLDI '00.

[4]  Robert D. Falgout,et al.  Semicoarsening Multigrid on Distributed Memory Machines , 1999, SIAM J. Sci. Comput..

[5]  Matteo Frigo,et al.  Cache-oblivious algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[6]  Zhiyuan Li,et al.  A Compiler Framework for Tiling Imperfectly-Nested Loops , 1999, LCPC.

[7]  Zhiyuan Li,et al.  New tiling techniques to improve cache temporal locality , 1999, PLDI '99.

[8]  John D. McCalpin,et al.  Time Skewing: A Value-Based Approach to Optimizing for Memory Locality , 1999 .

[9]  Vikram S. Adve,et al.  High Performance Fortran Compilation Techniques for Parallelizing Scientific Codes , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[10]  Vikram S. Adve,et al.  Using integer sets for data-parallel program analysis and optimization , 1998, PLDI.

[11]  Jeremy D. Frens,et al.  Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code , 1997, PPOPP '97.

[12]  Keshav Pingali,et al.  Data-centric multi-level blocking , 1997, PLDI '97.

[13]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[14]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[15]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[16]  Ken Kennedy,et al.  Improving the ratio of memory operations to floating-point operations in loops , 1994, TOPL.

[17]  Ken Kennedy,et al.  Compiler blockability of numerical algorithms , 1992, Proceedings Supercomputing '92.

[18]  Ken Kennedy,et al.  Interprocedural transformations for parallel code generation , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[19]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[20]  Ken Kennedy,et al.  Improving register allocation for subscripted variables , 1990, PLDI '90.