Performance Limits Study of Stencil Codes on Modern GPGPUs

We study the performance limits of different algorithmic approaches to the implementation of a sample problem of wave equation solution with a cross stencil scheme. With this, we aim to find the highest limit of the achievable performance efficiency for stencil computing. To estimate the limits, we use a quantitative Roofline model to make a thorough analysis of the performance bottlenecks and develop the model further to account for the latency of different levels of GPU memory.  These estimates provide an incentive to use spatial and temporal blocking algorithms. Thus, we study stepwise, domain decomposition, and domain decomposition with halo algorithms in that order. The knowledge of the limit incites the motivation to optimize the implementation. This led to the analysis of the block synchronization methods in CUDA, which is also provided in the text.  After all optimizations, we have achieved 90% of the peak performance, which amounts to more than 1 trillion cell updates per second on one consumer level GPU device.

[1]  Alejandro Duran,et al.  Effective Use of Large High-Bandwidth Memory Caches in HPC Stencil Computation via Temporal Wave-Front Tiling , 2016, 2016 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS).

[2]  Sanjay V. Rajopadhye,et al.  Generation of Efficient Nested Loops from Polyhedra , 2000, International Journal of Parallel Programming.

[3]  Naoya Maruyama,et al.  Optimizing Stencil Computations for NVIDIA Kepler GPUs , 2014 .

[4]  Hikaru Inoue,et al.  Automatic generation of efficient codes from mathematical descriptions of stencil computation , 2016, FHPC@ICFP.

[5]  Paulius Micikevicius,et al.  3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.

[6]  Marco Maggioni,et al.  Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking , 2018, ArXiv.

[7]  Hao Wang,et al.  GPU-UniCache: Automatic Code Generation of Spatial Blocking for Stencils on GPUs , 2017, Conf. Computing Frontiers.

[8]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[9]  B. Fornberg Generation of finite difference formulas on arbitrarily spaced grids , 1988 .

[10]  Chau-Wen Tseng,et al.  Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[11]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[12]  Lorena A. Barba,et al.  How Will the Fast Multipole Method Fare in the Exascale Era , 2013 .

[13]  Marcin Dabrowski,et al.  Efficient 3D stencil computations using CUDA , 2013, Parallel Comput..

[14]  P. Sadayappan,et al.  High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.

[15]  Vadim D. Levchenko,et al.  Detailed numerical simulation of shock-body interaction in 3D multicomponent flow using the RKDG numerical method and ”DiamondTorre” GPU algorithm of implementation , 2016 .

[16]  Charles E. Leiserson,et al.  Cache-Oblivious Algorithms , 2003, CIAC.

[17]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Sergei Gorlatch,et al.  High performance stencil code generation with Lift , 2018, CGO.

[19]  Toshio Endo Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memory Hierarchy , 2018, 2018 IEEE 7th Non-Volatile Memory Systems and Applications Symposium (NVMSA).

[20]  Massimiliano Fatica,et al.  Implementing the Himeno benchmark with CUDA on GPU clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[21]  Marco Maggioni,et al.  Dissecting the NVidia Turing T4 GPU via Microbenchmarking , 2019, ArXiv.

[22]  Takeshi Fukaya,et al.  Time-space tiling with tile-level parallelism for the 3D FDTD method , 2018, HPC Asia.

[23]  Junichiro Makino,et al.  Optimal Temporal Blocking for Stencil Computation , 2015, ICCS.

[24]  Danilo De Donno,et al.  Introduction to GPU Computing and CUDA Programming: A Case Study on FDTD [EM Programmer's Notebook] , 2010 .

[25]  Andrey Zakirov,et al.  High performance FDTD algorithm for GPGPU supercomputers , 2016 .

[26]  Hans-Joachim Bungartz,et al.  A Holistic Scalable Implementation Approach of the Lattice Boltzmann Method for CPU/GPU Heterogeneous Clusters , 2017, Comput..

[27]  V. Levchenko,et al.  Locally Recursive Non-Locally Asynchronous Algorithms for Stencil Computation , 2018 .

[28]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.