Efficient Sparse Matrix-Vector Multiplication on GPUs Using the CSR Storage Format

The performance of sparse matrix vector multiplication (SpMV) is important to computational scientists. Compressed sparse row (CSR) is the most frequently used format to store sparse matrices. However, CSR-based SpMV on graphics processing units (GPUs) has poor performance due to irregular memory access patterns, load imbalance, and reduced parallelism. This has led researchers to propose new storage formats. Unfortunately, dynamically transforming CSR into these formats has significant runtime and storage overheads. We propose a novel algorithm, CSR-Adaptive, which keeps the CSR format intact and maps well to GPUs. Our implementation addresses the aforementioned challenges by (i) efficiently accessing DRAM by streaming data into the local scratchpad memory and (ii) dynamically assigning different numbers of rows to each parallel GPU compute unit. CSR-Adaptive achieves an average speedup of 14.7× over existing CSR-based algorithms and 2.3× over clSpMV cocktail, which uses an assortment of matrix formats.

[1]  Matemática,et al.  Society for Industrial and Applied Mathematics , 2010 .

[2]  Richard W. Vuduc,et al.  Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.

[3]  John R. Gilbert,et al.  High-Performance Graph Algorithms from Parallel Sparse Matrices , 2006, PARA.

[4]  Rajesh Bordawekar,et al.  Optimizing Sparse Matrix-Vector Multiplication on GPUs using Compile-time and Run-time Strategies , 2008 .

[5]  Iain S. Duff,et al.  An overview of the sparse basic linear algebra subprograms: The new standard from the BLAS technical forum , 2002, TOMS.

[6]  Lukasz Miroslaw,et al.  Compressed Multiple-Row Storage Format , 2012, ArXiv.

[7]  Kurt Keutzer,et al.  clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs , 2012, ICS '12.

[8]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[9]  Collin McCurdy,et al.  The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.

[10]  Shengen Yan,et al.  yaSpMV: yet another SpMV framework on GPUs , 2014, PPoPP.

[11]  Murat Efe Guney,et al.  On the limits of GPU acceleration , 2010 .

[12]  Michael Garland,et al.  Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[13]  Shengen Yan,et al.  StreamScan: fast scan algorithms for GPUs without global barrier synchronization , 2013, PPoPP '13.

[14]  L. Trefethen,et al.  Numerical linear algebra , 1997 .

[15]  Lukasz Miroslaw,et al.  Compressed Multirow Storage Format for Sparse Matrices on Graphics Processing Units , 2012, SIAM J. Sci. Comput..

[16]  Karl Rupp,et al.  ViennaCL-A High Level Linear Algebra Library for GPUs and Multi-Core CPUs , 2010 .

[17]  D. Keyes,et al.  Toward Realistic Performance Bounds for Implicit CFD , 1999 .

[18]  Richard Vuduc,et al.  Automatic performance tuning of sparse matrix kernels , 2003 .

[19]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[20]  Eric S. Chung,et al.  SpMV: A Memory-Bound Application on the GPU Stuck Between a Rock and a Hard Place , 2012 .

[21]  I. Reguly,et al.  Efficient sparse matrix-vector multiplication on cache-based GPUs , 2012, 2012 Innovative Parallel Computing (InPar).

[22]  Michael Garland,et al.  Sparse matrix computations on manycore GPU’s , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[23]  Arutyun Avetisyan,et al.  Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures , 2010, HiPEAC.

[24]  Eun-Jin Im,et al.  Optimization of Sparse Matrix Kernels for Data Mining , 2007 .

[25]  Atsushi Suzuki,et al.  New Row-grouped CSR format for storing the sparse matrices on GPU with implementation in CUDA , 2010, ArXiv.

[26]  Aaftab Munshi,et al.  The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[27]  Gerhard Wellein,et al.  A unified sparse matrix data format for modern processors with wide SIMD units , 2013, ArXiv.

[28]  Rajesh Bordawekar,et al.  Optimizing Sparse Matrix-Vector Multiplication on GPUs , 2009 .

[29]  Francisco Vázquez,et al.  A new approach for sparse matrix vector product on NVIDIA GPUs , 2011, Concurr. Comput. Pract. Exp..

[30]  Hai Jin,et al.  Optimization of Sparse Matrix-Vector Multiplication with Variant CSR on GPUs , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[31]  William Gropp,et al.  Adaptive thread distributions for SpMV on a GPU , 2012 .

[32]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2009, Parallel Comput..