论文信息 - Towards Optimal Multi-level Tiling for Stencil Computations

Towards Optimal Multi-level Tiling for Stencil Computations

Stencil computations form the performance-critical core of many applications. Tiling and parallelization are two important optimizations to speed up stencil computations. Many tiling and parallelization strategies are applicable to a given stencil computation. The best strategy depends not only on the combination of the two techniques, but also on many parameters: tile and loop sizes in each dimension; computation-communication balance of the code; processor architecture; message startup costs; etc. The best choices can only be determined through design-space exploration, which is extremely tedious and error prone to do via exhaustive experimentation. We characterize the space of multi-level tilings and parallelizations for 2D/3D Gauss-Siedel stencil computation. A systematic exploration of a part of this space enabled us to derive a design which is up to a factor of two faster than the standard implementation.

[1] David G. Wonnacott,et al. Achieving Scalable Locality with Time Skewing , 2002, International Journal of Parallel Programming.

[2] Richard M. Karp,et al. The Organization of Computations for Uniform Recurrence Equations , 1967, JACM.

[3] William Gropp,et al. Solving PDEs on loosely-coupled parallel processors , 1987, Parallel Comput..

[4] Patrice Quinton,et al. The mapping of linear recurrence equations on regular arrays , 1989, J. VLSI Signal Process..

[5] Jack J. Dongarra,et al. Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[6] Jingling Xue,et al. Loop Tiling for Parallelism , 2000, Kluwer International Series in Engineering and Computer Science.

[7] Sanjay V. Rajopadhye,et al. Synthesizing systolic arrays from recurrence equations , 1990, Parallel Comput..

[8] J. Lofberg,et al. YALMIP : a toolbox for modeling and optimization in MATLAB , 2004, 2004 IEEE International Conference on Robotics and Automation (IEEE Cat. No.04CH37508).

[9] Leonid Oliker,et al. Impact of modern memory subsystems on cache optimizations for stencil computations , 2005, MSP '05.

[10] David G. Wonnacott,et al. Using time skewing to eliminate idle time due to memory bandwidth and network limitations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[11] G. Roth,et al. Compiling Stencils in High Performance Fortran , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[12] Michael Wolfe,et al. More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[13] Larry Carter,et al. Quantifying the Multi-Level Nature of Tiling Interactions , 1997, International Journal of Parallel Programming.

[14] Zhiyuan Li,et al. Automatic tiling of iterative stencil loops , 2004, TOPL.

[15] François Irigoin,et al. Supernode partitioning , 1988, POPL '88.

[16] Dan I. Moldovan,et al. Partitioning and Mapping Algorithms into Fixed Size Systolic Arrays , 1986, IEEE Transactions on Computers.

[17] Larry Carter,et al. Determining the idle time of a tiling , 1997, POPL '97.

[18] Johan Efberg,et al. YALMIP : A toolbox for modeling and optimization in MATLAB , 2004 .

[19] Stephen P. Boyd,et al. Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[20] Monica S. Lam,et al. A data locality optimizing algorithm , 1991, PLDI '91.

[21] Alain Darte. Regular partitioning for synthesizing fixed-size systolic arrays , 1991, Integr..

[22] Yinyu Ye,et al. An infeasible interior-point algorithm for solving primal and dual geometric programs , 1997, Math. Program..

[23] Sanjay V. Rajopadhye,et al. Optimal Semi-Oblique Tiling , 2003, IEEE Trans. Parallel Distributed Syst..

[24] Jingling Xue,et al. On Tiling as a Loop Transformation , 1997, Parallel Process. Lett..

[25] Michael A. Frumkin,et al. Tight bounds on cache use for stencil operations on rectangular grids , 2002, JACM.

[26] Sanjay V. Rajopadhye,et al. A Geometric Programming Framework for Optimal Multi-Level Tiling , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[27] Guy L. Steele,et al. Fortran at ten gigaflops: the connection machine convolution compiler , 1991, PLDI '91.

[28] Alok N. Choudhary,et al. Automatic optimization of communication in compiling out-of-core stencil codes , 1996, ICS '96.

[29] Chau-Wen Tseng,et al. Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).