Optimizing Stencil Computations for NVIDIA Kepler GPUs

We present a series of optimization techniques for stencil computations on NVIDIA Kepler GPUs. Stencil computations with regular grids had been ported to the older generations of NVIDIA GPUs with signicant performance improvements thanks to the higher memory bandwidth than conventional CPU-only systems. However, because of the architectural changes introduced with the latest generation of the GPU architecture, Kepler, we show that existing implementation strategies used for such older GPUs are not as eective on Kepler as before. To fully exploit the potential performance of the latest generation of the GPU architecture, our implementation method uses shared memory for better data locality combined with warp specialization for higher instruction throughput. Our method achieves approximately 80% of the estimated peak performance by the rooine model, and even higher performance with temporal blocking.

[1]  David G. Wonnacott,et al.  Using time skewing to eliminate idle time due to memory bandwidth and network limitations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[2]  Satoshi Matsuoka,et al.  Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[3]  Bradley C. Kuszmaul,et al.  The pochoir stencil compiler , 2011, SPAA '11.

[4]  P. Sadayappan,et al.  High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.

[5]  Satoshi Matsuoka,et al.  Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[6]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  Massimiliano Fatica,et al.  Implementing the Himeno benchmark with CUDA on GPU clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[8]  Paulius Micikevicius,et al.  3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.

[9]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Gerhard W. Zumbusch Vectorized Higher Order Finite Difference Kernels , 2012, PARA.

[12]  Brucek Khailany,et al.  CudaDMA: Optimizing GPU memory bandwidth via warp specialization , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[13]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[14]  Albert Cohen,et al.  Split tiling for GPUs: automatic parallelization using trapezoidal tiles , 2013, GPGPU@ASPLOS.