Towards Optimal Multi-level Tiling for Stencil Computations

Stencil computations form the performance-critical core of many applications. Tiling and parallelization are two important optimizations to speed up stencil computations. Many tiling and parallelization strategies are applicable to a given stencil computation. The best strategy depends not only on the combination of the two techniques, but also on many parameters: tile and loop sizes in each dimension; computation-communication balance of the code; processor architecture; message startup costs; etc. The best choices can only be determined through design-space exploration, which is extremely tedious and error prone to do via exhaustive experimentation. We characterize the space of multi-level tilings and parallelizations for 2D/3D Gauss-Siedel stencil computation. A systematic exploration of a part of this space enabled us to derive a design which is up to a factor of two faster than the standard implementation.

[1]  David G. Wonnacott,et al.  Achieving Scalable Locality with Time Skewing , 2002, International Journal of Parallel Programming.

[2]  Richard M. Karp,et al.  The Organization of Computations for Uniform Recurrence Equations , 1967, JACM.

[3]  William Gropp,et al.  Solving PDEs on loosely-coupled parallel processors , 1987, Parallel Comput..

[4]  Patrice Quinton,et al.  The mapping of linear recurrence equations on regular arrays , 1989, J. VLSI Signal Process..

[5]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[6]  Jingling Xue,et al.  Loop Tiling for Parallelism , 2000, Kluwer International Series in Engineering and Computer Science.

[7]  Sanjay V. Rajopadhye,et al.  Synthesizing systolic arrays from recurrence equations , 1990, Parallel Comput..

[8]  J. Lofberg,et al.  YALMIP : a toolbox for modeling and optimization in MATLAB , 2004, 2004 IEEE International Conference on Robotics and Automation (IEEE Cat. No.04CH37508).

[9]  Leonid Oliker,et al.  Impact of modern memory subsystems on cache optimizations for stencil computations , 2005, MSP '05.

[10]  David G. Wonnacott,et al.  Using time skewing to eliminate idle time due to memory bandwidth and network limitations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[11]  G. Roth,et al.  Compiling Stencils in High Performance Fortran , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[12]  Michael Wolfe,et al.  More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[13]  Larry Carter,et al.  Quantifying the Multi-Level Nature of Tiling Interactions , 1997, International Journal of Parallel Programming.

[14]  Zhiyuan Li,et al.  Automatic tiling of iterative stencil loops , 2004, TOPL.

[15]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[16]  Dan I. Moldovan,et al.  Partitioning and Mapping Algorithms into Fixed Size Systolic Arrays , 1986, IEEE Transactions on Computers.

[17]  Larry Carter,et al.  Determining the idle time of a tiling , 1997, POPL '97.

[18]  Johan Efberg,et al.  YALMIP : A toolbox for modeling and optimization in MATLAB , 2004 .

[19]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[20]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[21]  Alain Darte Regular partitioning for synthesizing fixed-size systolic arrays , 1991, Integr..

[22]  Yinyu Ye,et al.  An infeasible interior-point algorithm for solving primal and dual geometric programs , 1997, Math. Program..

[23]  Sanjay V. Rajopadhye,et al.  Optimal Semi-Oblique Tiling , 2003, IEEE Trans. Parallel Distributed Syst..

[24]  Jingling Xue,et al.  On Tiling as a Loop Transformation , 1997, Parallel Process. Lett..

[25]  Michael A. Frumkin,et al.  Tight bounds on cache use for stencil operations on rectangular grids , 2002, JACM.

[26]  Sanjay V. Rajopadhye,et al.  A Geometric Programming Framework for Optimal Multi-Level Tiling , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[27]  Guy L. Steele,et al.  Fortran at ten gigaflops: the connection machine convolution compiler , 1991, PLDI '91.

[28]  Alok N. Choudhary,et al.  Automatic optimization of communication in compiling out-of-core stencil codes , 1996, ICS '96.

[29]  Chau-Wen Tseng,et al.  Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).