On the Parallel Execution Time of Tiled Loops

Many computationally-intensive programs, such as those for differential equations, spatial interpolation, and dynamic programming, spend a large portion of their execution time in multiply-nested loops that have a regular stencil of data dependences. Tiling is a well-known compiler optimization that improves performance on such loops, particularly for computers with a multilevel hierarchy of parallelism and memory. Most previous work on tiling is limited in at least one of the following ways: they only handle nested loops of depth two, orthogonal tiling, or rectangular tiles. In our work, we tile loop nests of arbitrary depth using polyhedral tiles. We derive a prediction formula for the execution time of such tiled loops, which can be used by a compiler to automatically determine the tiling parameters that minimizes the execution time. We also explain the notion of rise, a measure of the relationship between the shape of the tiles and the shape of the iteration space generated by the loop nest. The rise is a powerful tool in predicting the execution time of a tiled loop. It allows us to reason about how the tiling affects the length of the longest path of dependent tiles, which is a measure of the execution time of a tiling. We use a model of the tiled iteration space that allows us to determine the length of the longest path of dependent tiles using linear programming. Using the rise, we derive a simple formula for the length of the longest path of dependent tiles in rectilinear iteration spaces, a subclass of the convex iteration spaces, and show how to choose the optimal tile shape.

[1]  John Paul Shen,et al.  Automatic partitioning of signal processing programs for symmetric multiprocessors , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.

[2]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[3]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[4]  Jingling Xue,et al.  Reuse-Driven Tiling for Improving Data Locality , 1998, International Journal of Parallel Programming.

[5]  Sanjay V. Rajopadhye,et al.  Generation of Efficient Nested Loops from Polyhedra , 2000, International Journal of Parallel Programming.

[6]  Steve Carr,et al.  Combining optimization for cache and instruction-level parallelism , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.

[7]  Michael E. Wolf,et al.  Improving locality and parallelism in nested loops , 1992 .

[8]  Sanjay V. Rajopadhye,et al.  Optimal Orthogonal Tiling , 1998, Euro-Par.

[9]  Larry Carter,et al.  Determining the idle time of a tiling , 1997, POPL '97.

[10]  Daniel A. Reed,et al.  Stencils and Problem Partitionings: Their Influence on the Performance of Multiple Processor Systems , 1987, IEEE Transactions on Computers.

[11]  Larry Carter,et al.  Selecting tile shape for minimal execution time , 1999, SPAA '99.

[12]  Vivek Sarkar,et al.  Automatic selection of high-order transformations in the IBM XL FORTRAN compilers , 1997, IBM J. Res. Dev..

[13]  Ken Kennedy,et al.  Estimating Interlock and Improving Balance for Pipelined Architectures , 1988, J. Parallel Distributed Comput..

[14]  Larry Carter,et al.  Predicting performance for tiled perfectly nested loops , 1999 .

[15]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[16]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[17]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1987, TOPL.

[18]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[19]  Yves Robert,et al.  (Pen)-ultimate tiling? , 1994, Integr..

[20]  Michael E. Wolf,et al.  Combining Loop Transformations Considering Caches and Scheduling , 2004, International Journal of Parallel Programming.

[21]  Ramesh Subramonian,et al.  LogP: a practical model of parallel computation , 1996, CACM.

[22]  J. Ramanujam,et al.  Tiling multidimensional iteration spaces for nonshared memory machines , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[23]  Ken Kennedy,et al.  Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.

[24]  Ken Kennedy,et al.  Optimizing for parallelism and data locality , 1992, ICS '92.

[25]  Ken Kennedy,et al.  Software for supercomputers of the future , 1992, The Journal of Supercomputing.

[26]  Larry Carter,et al.  Efficient Parallelism via Hierarchical Tiling , 1995, PPSC.

[27]  Michael Wolfe,et al.  Iteration Space Tiling for Memory Hierarchies , 1987, PPSC.

[28]  Larry Carter,et al.  Hierarchical tiling for improved superscalar performance , 1995, Proceedings of 9th International Parallel Processing Symposium.

[29]  Keshav Pingali,et al.  Data-centric multi-level blocking , 1997, PLDI '97.

[30]  Yves Robert,et al.  Determining the idle time of a tiling: new results , 1997, Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques.

[31]  Zhiyuan Li,et al.  New tiling techniques to improve cache temporal locality , 1999, PLDI '99.

[32]  Yves Robert,et al.  Linear Scheduling Is Nearly Optimal , 1991, Parallel Process. Lett..

[33]  Michael Wolfe,et al.  More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[34]  Monica S. Lam,et al.  A Loop Transformation Theory and an Algorithm to Maximize Parallelism , 1991, IEEE Trans. Parallel Distributed Syst..

[35]  David G. Wonnacott,et al.  Time Skewing for Parallel Computers , 1999, LCPC.

[36]  Ken Kennedy,et al.  Compiler blockability of numerical algorithms , 1992, Proceedings Supercomputing '92.

[37]  Larry Carter,et al.  Quantifying the Multi-Level Nature of Tiling Interactions , 1997, International Journal of Parallel Programming.

[38]  Sanjay V. Rajopadhye,et al.  Optimizing memory usage in the polyhedral model , 2000, TOPL.

[39]  Chau-Wen Tseng,et al.  Compiler optimizations for improving data locality , 1994, ASPLOS VI.

[40]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.