Model-Driven Tile Size Selection for DOACROSS Loops on GPUs

DOALL loops are tiled to exploit DOALL parallelism and data locality on GPUs. In contrast, due to loop-carried dependences, DOACROSS loops must be skewed first in order to make tiling legal and exploit wavefront parallelism across the tiles and within a tile. Thus, tile size selection, which is performance-critical, becomes more complex for DOACROSS loops than DOALL loops on GPUs. This paper presents a model-driven approach to automating this process. Validation using 1D, 2D and 3D SOR solvers shows that our framework can find the tile sizes for these representative DOACROSS loops to achieve performances close to the best observed for a range of problem sizes tested.

[1]  Rudolf Eigenmann,et al.  OpenMP to GPGPU: a compiler framework for automatic translation and optimization , 2009, PPoPP '09.

[2]  Richard W. Vuduc,et al.  Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.

[3]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[4]  Yi Yang,et al.  A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.

[5]  Dongrui Fan,et al.  Extendable pattern-oriented optimization directives , 2011, CGO 2011.

[6]  Yang Yang,et al.  Automatic Library Generation for BLAS3 on GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[7]  Tomofumi Yuki,et al.  Automatic creation of tile size selection models , 2010, CGO '10.

[8]  Xipeng Shen,et al.  A cross-input adaptive framework for GPU program optimizations , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[9]  Andreas Moshovos,et al.  Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[10]  William Gropp,et al.  An adaptive performance modeling tool for GPU architectures , 2010, PPoPP '10.

[11]  J. Ramanujam,et al.  Automatic C-to-CUDA Code Generation for Affine Programs , 2010, CC.

[12]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[13]  A. Quarteroni,et al.  Numerical Approximation of Partial Differential Equations , 2008 .

[14]  Wu-chun Feng,et al.  Inter-block GPU communication via fast barrier synchronization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[15]  Hui Wu,et al.  Toward Harnessing DOACROSS Parallelism for Multi-GPGPUs , 2010, 2010 39th International Conference on Parallel Processing.

[16]  Jingling Xue,et al.  Code tiling for improving the cache performance of PDE solvers , 2003, 2003 International Conference on Parallel Processing, 2003. Proceedings..

[17]  J. Craggs Applied Mathematical Sciences , 1973 .

[18]  Dongrui Fan,et al.  Extendable pattern-oriented optimization directives , 2012, International Symposium on Code Generation and Optimization (CGO 2011).

[19]  W. Hackbusch Iterative Solution of Large Sparse Systems of Equations , 1993 .

[20]  Th. Lippert,et al.  A parallel SSOR preconditioner for lattice QCD , 1996, hep-lat/9608066.

[21]  Jingling Xue,et al.  Loop Tiling for Parallelism , 2000, Kluwer International Series in Engineering and Computer Science.

[22]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.