Modeling the Performance of 2.5D Blocking of 3D Stencil Code on GPUs

The performance of stencil computations can be improved significantly by using GPUs. In particular, 3D stencils are known to benefit from the 2.5D blocking optimization, which reduces the required global memory bandwidth of the stencils and is critical to attaining high performance on GPU. Using four different GPU implementations of a 3D stencil, this paper studies the performance implications of combining 2.5D blocking with different memory placement strategies, including using global memory only, shared memory only, register files only, and a hybrid strategy that uses all layers of the memories. Based on static analysis of the stencil data access patterns, we additionally develop heuristics to reduce tuning time of thread configurations of the various implementations to attain the highest performance.

[1]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Naoya Maruyama,et al.  Optimizing Stencil Computations for NVIDIA Kepler GPUs , 2014 .

[3]  Marcin Dabrowski,et al.  Efficient 3D stencil computations using CUDA , 2013, Parallel Comput..

[4]  Xing Cai,et al.  An analytical GPU performance model for 3D stencil computations from the angle of data traffic , 2015, The Journal of Supercomputing.

[5]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[6]  P. Sadayappan,et al.  Characterizing and enhancing global memory data coalescing on GPUs , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[7]  Andreas Resios GPU performance prediction using parametrized models , 2011 .

[8]  Cosmin Nita,et al.  Optimized three-dimensional stencil computation on Fermi and Kepler GPUs , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).

[9]  Xinxin Mei,et al.  Dissecting GPU Memory Hierarchy Through Microbenchmarking , 2015, IEEE Transactions on Parallel and Distributed Systems.

[10]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Henk Corporaal,et al.  Demystifying the 16 × 16 thread‐block for stencils on the GPU , 2015, Concurr. Comput. Pract. Exp..

[12]  Wen-mei W. Hwu,et al.  Analytical Performance Prediction for Evaluation and Tuning of GPGPU Applications , 2009 .