A review of CUDA optimization techniques and tools for structured grid computing

Recent advances in GPUs opened a new opportunity in harnessing their computing power for general purpose computing. CUDA, an extension to C programming, is developed for programming NVIDIA GPUs. However, efficiently programming GPUs using CUDA is very tedious and error prone even for the expert programmers. Programmer has to optimize the resource occupancy and manage the data transfers between host and GPU, and across the memory system. This paper presents the basic architectural optimizations and explore their implementations in research and industry compilers. The focus of the presented review is on accelerating computational science applications such as the class of structured grid computation (SGC). It also discusses the mismatch between current compiler techniques and the requirements for implementing efficient iterative linear solvers. It explores the approaches used by computational scientists to program SGCs. Finally, a set of tools with the main optimization functionalities for an integrated library are proposed to ease the process of defining complex SGC data structure and optimizing solver code using intelligent high-level interface and domain specific annotations.

[1]  Naoya Maruyama,et al.  Optimizing Stencil Computations for NVIDIA Kepler GPUs , 2014 .

[2]  Benoît Meister,et al.  A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction , 2010, GPGPU-3.

[3]  Fayez Gebali,et al.  Algorithms and Parallel Computing: Gebali/Algorithms and Parallel Computing , 2011 .

[4]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Wu-chun Feng,et al.  Inter-block GPU communication via fast barrier synchronization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[6]  Fayez Gebali,et al.  Algorithms and Parallel Computing , 2011 .

[7]  Wen-mei W. Hwu,et al.  CUDA-Lite: Reducing GPU Programming Complexity , 2008, LCPC.

[8]  Mayez A. Al-Mouhamed,et al.  RT-CUDA: A Software Tool for CUDA Code Restructuring , 2016, International Journal of Parallel Programming.

[9]  Anders Logg,et al.  Unified form language: A domain-specific language for weak formulations of partial differential equations , 2012, TOMS.

[10]  Yuping Zhang,et al.  Optimizing sparse matrix-vector multiplication on CUDA , 2010, 2010 2nd International Conference on Education Technology and Computer.

[11]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[12]  Wilfred Pinfold,et al.  Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis , 2009, HiPC 2009.

[13]  P. Sadayappan,et al.  Characterizing dataset dependence for sparse matrix-vector multiplication on GPUs , 2015 .

[14]  Jack J. Dongarra,et al.  Towards dense linear algebra for hybrid GPU accelerated manycore systems , 2009, Parallel Comput..

[15]  Greg van Anders,et al.  Accelerating the solution of families of shifted linear systems with CUDA , 2011 .

[16]  강성호 US PATENT AND TRADEMARK OFFICE 등 , 1999 .

[17]  Zhang Qian,et al.  A new method of Sparse Matrix-Vector Multiplication on GPU , 2012, Proceedings of 2012 2nd International Conference on Computer Science and Network Technology.

[18]  Michael Garland,et al.  Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[19]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[20]  P. Sadayappan,et al.  High-performance sparse matrix-vector multiplication on GPUs for structured grid computations , 2012, GPGPU-5.

[21]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[22]  Rudolf Eigenmann,et al.  OpenMPC: extended OpenMP for efficient programming and tuning on GPUs , 2013, Int. J. Comput. Sci. Eng..

[23]  Christoph W. Kessler,et al.  SkePU 2: Flexible and Type-Safe Skeleton Programming for Heterogeneous Parallel Systems , 2018, International Journal of Parallel Programming.

[24]  Zhimin Li,et al.  An improved sparse matrix-vector multiplication kernel for solving modified equation in large scale power flow calculation on CUDA , 2012, Proceedings of The 7th International Power Electronics and Motion Control Conference.

[25]  Jacqueline Chame,et al.  A script-based autotuning compiler system to generate high-performance CUDA code , 2013, TACO.

[26]  Shoaib Kamil,et al.  OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[27]  Samuel Williams,et al.  A Generalized Framework for Auto-tuning Stencil Computations , 2009 .

[28]  Peter Kilpatrick,et al.  A parallel pattern for iterative stencil + reduce , 2016, The Journal of Supercomputing.

[29]  Tarek S. Abdelrahman,et al.  hiCUDA: High-Level GPGPU Programming , 2011, IEEE Transactions on Parallel and Distributed Systems.

[30]  François Bodin,et al.  Heterogeneous multicore parallel programming for graphics processing units , 2009, Sci. Program..

[31]  Diego Alejandro Rivera-Polanco COLLECTIVE COMMUNICATION AND BARRIER SYNCHRONIZATION ON NVIDIA CUDA GPU , 2009 .

[32]  Bronis R. de Supinski,et al.  OpenMP for Accelerators , 2011, IWOMP.

[33]  Laura S. Hjerpe,et al.  U.S. Patent and Trademark Office , 2003 .

[34]  Nazeeruddin Mohammad,et al.  Optimizing the Matrix Multiplication Using Strassen and Winograd Algorithms with Limited Recursions on Many-Core , 2015, International Journal of Parallel Programming.

[35]  Klaus Mueller,et al.  Why do commodity graphics hardware boards (GPUs) work so well for acceleration of computed tomography? , 2007, Electronic Imaging.

[36]  P. Sadayappan,et al.  Stencil-Aware GPU Optimization of Iterative Solvers , 2013, SIAM J. Sci. Comput..

[37]  Jack J. Dongarra,et al.  Acceleration of GPU-based Krylov solvers via data transfer reduction , 2015, Int. J. High Perform. Comput. Appl..

[38]  Zhaohui Du,et al.  Data and computation transformations for Brook streaming applications on multiprocessors , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[39]  Boyana Norris,et al.  Autotuning Stencil-Based Computations on GPUs , 2012, 2012 IEEE International Conference on Cluster Computing.

[40]  Mayez A. Al-Mouhamed,et al.  SpMV and BiCG-Stab optimization for a class of hepta-diagonal-sparse matrices on GPU , 2017, The Journal of Supercomputing.

[41]  Jiaquan Gao,et al.  Efficient CSR-Based Sparse Matrix-Vector Multiplication on GPU , 2016 .

[42]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[43]  Samuel Williams,et al.  An auto-tuning framework for parallel multicore stencil computations , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[44]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, ACM Trans. Graph..

[45]  Mark S. Peercy,et al.  A performance-oriented data parallel virtual machine for GPUs , 2006, SIGGRAPH '06.

[46]  Srinivasan Parthasarathy,et al.  Automatic Selection of Sparse Matrix Representation on GPUs , 2015, ICS.

[47]  Satoshi Matsuoka,et al.  High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning , 2010, Computer Science - Research and Development.