Optimized three-dimensional stencil computation on Fermi and Kepler GPUs

Stencil based algorithms are used intensively in scientific computations. Graphics Processing Units (GPU) based implementations of stencil computations speed-up the execution significantly compared to conventional CPU only systems. In this paper we focus on double precision stencil computations, which are required for meeting the high accuracy requirements, inherent for scientific computations. Starting from two baseline implementations (using two dimensional and three dimensional thread block structures respectively), we employ different optimization techniques which lead to seven kernel versions. Both Fermi and Kepler GPUs are used, to evaluate the impact of different optimization techniques for the two architectures. Overall, the GTX680 GPU card performs best for a kernel with 2D thread block structure and optimized register and shared memory usage. We show that, whereas shared memory is not essential for Fermi GPUs, it is a highly efficient optimization technique for Kepler GPUs (mainly due to the different L1 cache usage). Furthermore, we evaluate the performance of Kepler GPU cards designed for desktop PCs and notebook PCs. The results indicate that the ratio of execution time is roughly equal to the inverse of the ratio of power consumption.

[1]  F. Moldoveanu,et al.  GPU optimized computation of stencil based algorithms , 2011, 2011 RoEduNet International Conference 10th Edition: Networking in Education and Research.

[2]  Geoffrey Fox Concurrent Processing for Scientific Calculations , 1984, COMPCON.

[3]  Jie Cheng,et al.  Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..

[4]  Paulius Micikevicius,et al.  3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.

[5]  P. Sadayappan,et al.  High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.

[6]  Naoya Maruyama,et al.  Optimizing Stencil Computations for NVIDIA Kepler GPUs , 2014 .

[7]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Gerhard W. Zumbusch Vectorized Higher Order Finite Difference Kernels , 2012, PARA.

[9]  Cosmin Nita,et al.  GPU accelerated blood flow computation using the Lattice Boltzmann Method , 2013, 2013 IEEE High Performance Extreme Computing Conference (HPEC).

[10]  Michael Griebel,et al.  Solving incompressible two-phase flows on multi-GPU clusters , 2013 .

[11]  Albert Cohen,et al.  Split tiling for GPUs: automatic parallelization using trapezoidal tiles , 2013, GPGPU@ASPLOS.

[12]  Daniel A. Reed,et al.  Stencils and Problem Partitionings: Their Influence on the Performance of Multiple Processor Systems , 1987, IEEE Transactions on Computers.

[13]  Massimiliano Fatica,et al.  Implementing the Himeno benchmark with CUDA on GPU clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[14]  Satoshi Matsuoka,et al.  Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).