Accelerating Scientific Applications using High Performance Dense and Sparse Linear Algebra Kernels on GPUs

Accelerating Scientific Applications using High Performance Dense and Sparse Linear Algebra Kernels on GPUs Ahmad Mohammad Abdelfattah Ahmad High performance computing (HPC) platforms are evolving to more heterogeneous configurations to support the workloads of various applications. The current hardware landscape is composed of traditional multicore CPUs equipped with hardware accelerators that can handle high levels of parallelism. Graphical Processing Units (GPUs) are popular high performance hardware accelerators in modern supercomputers. GPU programming has a di↵erent model than that for CPUs, which means that many numerical kernels have to be redesigned and optimized specifically for this architecture. GPUs usually outperform multicore CPUs in some compute intensive and massively parallel applications that have regular processing patterns. However, most scientific applications rely on crucial memory-bound kernels and may witness bottlenecks due to the overhead of the memory bus latency. They can still take advantage of the GPU compute power capabilities, provided that an e cient architecture-aware design is achieved. This dissertation presents a uniform design strategy for optimizing critical memorybound kernels on GPUs. Based on hierarchical register blocking, double bu↵ering and latency hiding techniques, this strategy leverages the performance of a wide range of standard numerical kernels found in dense and sparse linear algebra libraries. The 5 work presented here focuses on matrix-vector multiplication kernels (MVM) as representative and most important memory-bound operations in this context. Each kernel inherits the benefits of the proposed strategies. By exposing a proper set of tuning parameters, the strategy is flexible enough to suit di↵erent types of matrices, ranging from large dense matrices, to sparse matrices with dense block structures, while high performance is maintained. Furthermore, the tuning parameters are used to maintain the relative performance across di↵erent GPU architectures. Multi-GPU acceleration is proposed to scale the performance on several devices. The performance experiments show improvements ranging from 10% and up to more than fourfold speedup against competitive GPU MVM approaches. Performance impacts on high-level numerical libraries and a computational astronomy application are highlighted, since such memory-bound kernels are often located in innermost levels of the software chain. The excellent performance obtained in this work has led to the adoption of code in NVIDIAs widely distributed cuBLAS library.

[1]  David E. Keyes,et al.  Pipelining Computational Stages of the Tomographic Reconstructor for Multi-Object Adaptive Optics on a Multi-GPU System , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  James Demmel,et al.  ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, PARA.

[3]  Bertil Schmidt,et al.  CUDA-enabled Sparse Matrix-Vector Multiplication on GPUs using atomic operations , 2013, Parallel Comput..

[4]  P. Sadayappan,et al.  High-performance sparse matrix-vector multiplication on GPUs for structured grid computations , 2012, GPGPU-5.

[5]  Jack J. Dongarra,et al.  From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming , 2012, Parallel Comput..

[6]  Srinivasan Parthasarathy,et al.  Fast Sparse Matrix-Vector Multiplication on GPUs for Graph Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  Gerhard Wellein,et al.  A unified sparse matrix data format for modern processors with wide SIMD units , 2013, ArXiv.

[8]  P. Sadayappan,et al.  An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on GPUs , 2014, ICS '14.

[9]  Jack J. Dongarra,et al.  Autotuning GEMM Kernels for the Fermi GPU , 2012, IEEE Transactions on Parallel and Distributed Systems.

[10]  Gerhard Wellein,et al.  Sparse Matrix-vector Multiplication on GPGPU Clusters: A New Storage Format and a Scalable Implementation , 2011, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[11]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[12]  David E. Keyes,et al.  Optimizing Memory-Bound SYMV Kernel on GPU Hardware Accelerators , 2012, VECPAR.

[13]  Rajesh Bordawekar,et al.  Optimizing Sparse Matrix-Vector Multiplication on GPUs , 2009 .

[14]  Jack Dongarra,et al.  Implementing a Sparse Matrix Vector Product for the SELL-C/SELL-C-sigma formats on NVIDIA GPUs , 2014 .

[15]  Ester M. Garzón,et al.  Improving the Performance of the Sparse Matrix Vector Product with GPUs , 2010, 2010 10th IEEE International Conference on Computer and Information Technology.

[16]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[17]  Stefan Turek,et al.  Towards a complete FEM-based simulation toolkit on GPUs: Unstructured grid finite element geometric multigrid solvers with strong smoothers based on sparse approximate inverses , 2013 .

[18]  Michael Garland,et al.  Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[19]  Jack J. Dongarra,et al.  Tridiagonalization of a Symmetric Dense Matrix on a GPU Cluster , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[20]  Eric J. Kelmelis,et al.  CULA: hybrid GPU accelerated linear algebra routines , 2010, Defense + Commercial Sensing.

[21]  Y. Saad,et al.  GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems , 1986 .

[22]  J. Cuby,et al.  ELT-MOS White Paper: Science Overview & Requirements , 2013, 1303.0029.

[23]  Matthew G. Knepley,et al.  Preliminary Implementation of PETSc Using GPUs , 2013 .

[24]  Richard W. Vuduc,et al.  Sparsity: Optimization Framework for Sparse Matrix Kernels , 2004, Int. J. High Perform. Comput. Appl..

[25]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[26]  Bilel Hadri,et al.  The Automatic Library Tracking Database , 2010 .

[27]  J. Krüger,et al.  Linear algebra operators for GPU implementation of numerical algorithms , 2003, ACM Trans. Graph..

[28]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, SIGGRAPH 2004.

[29]  Katherine A. Yelick,et al.  Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY , 2001, International Conference on Computational Science.

[30]  David E. Keyes,et al.  Systematic Approach in Optimizing Numerical Memory-Bound Kernels on GPU , 2012, Euro-Par Workshops.

[31]  Thomas C. Oppe,et al.  ITPACKV 2D user's guide , 1989 .

[32]  Robert A. van de Geijn,et al.  The libflame Library for Dense Matrix Computations , 2009, Computing in Science & Engineering.

[33]  Jack J. Dongarra,et al.  Optimizing symmetric dense matrix-vector multiplication on GPUs , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[34]  Jack Dongarra,et al.  Preliminary results of autotuning GEMM kernels for the NVIDIA Kepler architecture-GeForce GTX 680 , 2012 .

[35]  Jack J. Dongarra,et al.  Accelerating GPU Kernels for Dense Linear Algebra , 2010, VECPAR.

[36]  David E. Keyes,et al.  High Performance Pseudo-analytical Simulation of Multi-Object Adaptive Optics over Multi-GPU Systems , 2014, Euro-Par.

[37]  Wolfgang Hackbusch,et al.  A Sparse Matrix Arithmetic Based on H-Matrices. Part I: Introduction to H-Matrices , 1999, Computing.

[38]  Bilel Hadri,et al.  Software Usage on Cray Systems across Three Centers (NICS, ORNL and CSCS) , 2012 .

[39]  Jie Cheng,et al.  Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..

[40]  Jack J. Dongarra,et al.  Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems , 2014, Concurr. Comput. Pract. Exp..

[41]  Francisco Vázquez,et al.  A new approach for sparse matrix vector product on NVIDIA GPUs , 2011, Concurr. Comput. Pract. Exp..

[42]  James Demmel,et al.  the Parallel Computing Landscape , 2022 .

[43]  Jack J. Dongarra,et al.  A Note on Auto-tuning GEMM for GPUs , 2009, ICCS.

[44]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[45]  M. Hestenes,et al.  Methods of conjugate gradients for solving linear systems , 1952 .

[46]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[47]  Richard W. Vuduc,et al.  Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.

[48]  Ninghui Sun,et al.  Fast implementation of DGEMM on Fermi GPU , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[49]  A. Sevin,et al.  A novel fast and accurate pseudo-analytical simulation approach for MOAO , 2014, Astronomical Telescopes and Instrumentation.

[50]  David E. Keyes,et al.  KBLAS: An Optimized Library for Dense Matrix-Vector Multiplication on GPU Accelerators , 2014, ACM Trans. Math. Softw..

[51]  Arutyun Avetisyan,et al.  Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures , 2010, HiPEAC.

[52]  Joseph L. Greathouse,et al.  Efficient Sparse Matrix-Vector Multiplication on GPUs Using the CSR Storage Format , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[53]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[54]  Shengen Yan,et al.  yaSpMV: yet another SpMV framework on GPUs , 2014, PPoPP.

[55]  Jack J. Dongarra,et al.  Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing , 2010, Parallel Comput..

[56]  Jack J. Dongarra,et al.  An Improved Magma Gemm For Fermi Graphics Processing Units , 2010, Int. J. High Perform. Comput. Appl..