Automatic tiling of iterative stencil loops

Iterative stencil loops are used in scientific programs to implement relaxation methods for numerical simulation and signal processing. Such loops iteratively modify the same array elements over different time steps, which presents opportunities for the compiler to improve the temporal data locality through loop tiling. This article presents a compiler framework for automatic tiling of iterative stencil loops, with the objective of improving the cache performance. The article first presents a technique which allows loop tiling to satisfy data dependences in spite of the difficulty created by imperfectly nested inner loops. It does so by skewing the inner loops over the time steps and by applying a uniform skew factor to all loops at the same nesting level. Based on a memory cost analysis, the article shows that the skew factor must be minimized at every loop level in order to minimize cache misses. A graph-theoretical algorithm, which takes polynomial time, is presented to determine the minimum skew factor. Furthermore, the memory-cost analysis derives the tile size which minimizes capacity misses. Given the tile size, an efficient and general <i>array-padding</i> scheme is applied to remove conflict misses. Experiments were conducted on 16 test programs and preliminary results showed an average speedup of 1.58 and a maximum speedup of 5.06 across those test programs.

[1]  Leland L. Beck,et al.  Smallest-last ordering and clustering and graph coloring algorithms , 1983, JACM.

[2]  Viktor K. Prasanna,et al.  Analysis of memory hierarchy performance of block data layout , 2002, Proceedings International Conference on Parallel Processing.

[3]  Cheng Wang,et al.  Data locality enhancement by memory reduction , 2001, ICS '01.

[4]  Jeremy D. Frens,et al.  Language support for Morton-order matrices , 2001, PPoPP '01.

[5]  Mithuna Thottethodi,et al.  Recursive array layouts and fast parallel matrix multiplication , 1999, SPAA '99.

[6]  Mithuna Thottethodi,et al.  Nonlinear array layouts for hierarchical memory systems , 1999, ICS '99.

[7]  Vivek Sarkar,et al.  On Estimating and Enhancing Cache Effectiveness , 1991, LCPC.

[8]  Sharad Malik,et al.  Precise miss analysis for program transformations with caches of arbitrary associativity , 1998, ASPLOS VIII.

[9]  Ravindra K. Ahuja,et al.  Network Flows: Theory, Algorithms, and Applications , 1993 .

[10]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[11]  Zhiyuan Li,et al.  New tiling techniques to improve cache temporal locality , 1999, PLDI '99.

[12]  Chau-Wen Tseng,et al.  A Comparison of Compiler Tiling Algorithms , 1999, CC.

[13]  Mahmut T. Kandemir,et al.  A matrix-based approach to the global locality optimization problem , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[14]  Adams Multigrid software for elliptic partial differential equations: MUDPACK. Technical note , 1991 .

[15]  Tarek S. Abdelrahman,et al.  Fusion of Loops for Parallelism and Locality , 1997, IEEE Trans. Parallel Distributed Syst..

[16]  Michael E. Wolf,et al.  Improving locality and parallelism in nested loops , 1992 .

[17]  Ken Kennedy Fast greedy weighted fusion , 2000, ICS '00.

[18]  Fred G. Gustavson,et al.  Recursive Formulation of Some Dense Linear Algebra Algorithms , 1999, PPSC.

[19]  Jingling Xue,et al.  Loop Tiling for Parallelism , 2000, Kluwer International Series in Engineering and Computer Science.

[20]  Monica S. Lam,et al.  Data and computation transformations for multiprocessors , 1995, PPOPP '95.

[21]  David A. Patterson,et al.  Computer architecture (2nd ed.): a quantitative approach , 1996 .

[22]  Keshav Pingali,et al.  Transformations for Imperfectly Nested Loops , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[23]  Zhiyuan Li,et al.  Experience with efficient array data flow analysis for array privatization , 1997, PPOPP '97.

[24]  William Pugh,et al.  A practical algorithm for exact array dependence analysis , 1992, CACM.

[25]  Larry Carter,et al.  Quantifying the Multi-Level Nature of Tiling Interactions , 1997, International Journal of Parallel Programming.

[26]  William Pugh,et al.  Iteration Space Slicing for Locality , 1999, LCPC.

[27]  D. Burger,et al.  Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[28]  David G. Wonnacott,et al.  Achieving Scalable Locality with Time Skewing , 2002, International Journal of Parallel Programming.

[29]  Jacqueline Chame,et al.  A tile selection algorithm for data locality and cache interference , 1999, ICS '99.

[30]  Larry Carter,et al.  Schedule-independent storage mapping for loops , 1998, ASPLOS VIII.

[31]  Ken Kennedy,et al.  Automatic translation of FORTRAN programs to vector form , 1987, TOPL.

[32]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[33]  Keshav Pingali,et al.  Synthesizing Transformations for Locality Enhancement of Imperfectly-Nested Loop Nests , 2001, International Journal of Parallel Programming.

[34]  Vicki H. Allan,et al.  Software pipelining , 1995, CSUR.

[35]  Hiroshi Nakamura,et al.  Augmenting Loop Tiling with Data Alignment for Improved Cache Performance , 1999, IEEE Trans. Computers.

[36]  Ken Kennedy,et al.  Improving effective bandwidth through compiler enhancement of global cache reuse , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[37]  William Pugh,et al.  Exploiting Monotone Convergence Functions in Parallel Programs , 1996, LCPC.

[38]  Rudolf Eigenmann,et al.  Nonlinear and Symbolic Data Dependence Testing , 1998, IEEE Trans. Parallel Distributed Syst..

[39]  Yves Robert,et al.  Static tiling for heterogeneous computing platforms , 1999, Parallel Comput..

[40]  LiZhiyuan,et al.  Automatic tiling of iterative stencil loops , 2004 .

[41]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[42]  Olivier Temam,et al.  Cache interference phenomena , 1994, SIGMETRICS.

[43]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[44]  Zhiyuan Li,et al.  Interprocedural Analysis for Loop Scheduling and Data Allocation , 1998, Parallel Comput..

[45]  J. Ramanujam,et al.  Loop optimization for a class of memory-constrained computations , 2001, ICS '01.

[46]  Chau-Wen Tseng,et al.  Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[47]  Vivek Sarkar,et al.  A compiler framework for restructuring data declarations to enhance cache and TLB effectiveness , 1994, CASCON.

[48]  Guohua Jin,et al.  Increasing Temporal Locality with Skewing and Recursive Blocking , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[49]  William W. Pugh,et al.  Fine-grained analysis of array computations , 1998 .

[50]  Ken Kennedy,et al.  Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.

[51]  J.-F. Collard Space-time transformation of while-loops using speculative execution , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[52]  Keshav Pingali,et al.  Data-centric multi-level blocking , 1997, PLDI '97.

[53]  Vivek Sarkar Loop Transformations for Hierarchical Parallelism and Locality , 1998, LCR.

[54]  Michael F. P. O'Boyle,et al.  Non-singular data transformations: definition, validity and applications , 1997, ICS '97.

[55]  Ken Kennedy,et al.  RETROSPECTIVE: Coloring Heuristics for Register Allocation , 2022 .

[56]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .