Efficient Sparse-Dense Matrix-Matrix Multiplication on GPUs Using the Customized Sparse Storage Format

Multiplication of a sparse matrix to a dense matrix (SpDM) is widely used in many areas like scientific computing and machine learning. However, existing work under-looks the performance optimization of SpDM on modern manycore architectures like GPUs. The storage data structures help sparse matrices store in a memory-saving format, but they bring difficulties in optimizing the performance of SpDM on modern GPUs due to irregular data access of the sparse structure, which results in lower resource utilization and poorer performance. In this paper, we refer to the roofline performance model of GPUs to design an efficient SpDM algorithm called GCOOSpDM, in which we exploit coalescent global memory access, fast shared memory reuse, and more operations per byte of global memory traffic. Experiments are evaluated on three Nvidia GPUs (i.e., GTX 980, GTX Titan X Pascal, and Tesla P100) using a large number of matrices including a public dataset and randomly generated matrices. Experimental results show that GCOOSpDM achieves 1.5-8x speedup over Nvidia's library cuSPARSE in many matrices.

[1]  Jack Dongarra,et al.  Templates for the Solution of Algebraic Eigenvalue Problems , 2000, Software, environments, tools.

[2]  Alexander Tiskin,et al.  All-Pairs Shortest Paths Computation in the BSP Model , 2001, ICALP.

[3]  Cache Oblivious Dense and Sparse Matrix Multiplication Based on Peano Curves , 2008 .

[4]  J. Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[6]  Michael Garland,et al.  Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[7]  Xiaowen Chu,et al.  Practical Random Linear Network Coding on GPUs , 2009, Networking.

[8]  Jack J. Dongarra,et al.  An Improved Magma Gemm For Fermi Graphics Processing Units , 2010, Int. J. High Perform. Comput. Appl..

[9]  Riko Jacob,et al.  The I/O Complexity of Sparse Matrix Dense Matrix Multiplication , 2010, LATIN.

[10]  E M Garzón,et al.  A matrix approach to tomographic reconstruction and its implementation on GPUs. , 2010, Journal of structural biology.

[11]  Ki-Hwan Kim,et al.  Performance analysis and optimization of three-dimensional FDTD on GPU using roofline model , 2011, Comput. Phys. Commun..

[12]  Tomoya Sakai,et al.  Multi-level Optimization of Matrix Multiplication for GPU-equipped Systems , 2011, ICCS.

[13]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[14]  Bertil Schmidt,et al.  The Sliced COO Format for Sparse Matrix-Vector Multiplication on CUDA-enabled GPUs , 2012, ICCS.

[15]  Jack J. Dongarra,et al.  Autotuning GEMM Kernels for the Fermi GPU , 2012, IEEE Transactions on Parallel and Distributed Systems.

[16]  André Seznec,et al.  Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[17]  Joseph L. Greathouse,et al.  Efficient Sparse Matrix-Vector Multiplication on GPUs Using the CSR Storage Format , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Xinxin Mei,et al.  Benchmarking the Memory Hierarchy of Modern GPUs , 2014, NPC.

[19]  Francisco Vázquez,et al.  FastSpMM: An Efficient Library for Sparse Matrix Matrix Product on GPUs , 2014, Comput. J..

[20]  Hassan Foroosh,et al.  Sparse Convolutional Neural Networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Jack J. Dongarra,et al.  Performance, Design, and Autotuning of Batched GEMM for GPUs , 2016, ISC.

[22]  Leonid Oliker,et al.  Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[23]  Michael Garland,et al.  Merge-Based Parallel Sparse Matrix-Vector Multiplication , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[24]  Wu-chun Feng,et al.  Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[25]  Erich Elsen,et al.  Exploring Sparsity in Recurrent Neural Networks , 2017, ICLR.

[26]  Shaohuai Shi,et al.  Speeding up Convolutional Neural Networks By Exploiting the Sparsity of Rectifier Units , 2017, ArXiv.

[27]  Yannis Cotronis,et al.  A quantitative roofline model for GPU kernel performance estimation using micro-benchmarks and hardware metric profiling , 2017, J. Parallel Distributed Comput..

[28]  Xinxin Mei,et al.  Dissecting GPU Memory Hierarchy Through Microbenchmarking , 2015, IEEE Transactions on Parallel and Distributed Systems.

[29]  Mingyu Chen,et al.  Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning , 2017, PPoPP.

[30]  Xu Sun,et al.  meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting , 2017, ICML.

[31]  John D. Owens,et al.  Design Principles for Sparse Matrix Multiplication on the GPU , 2018, Euro-Par.

[32]  Xiaowen Chu,et al.  G-CRS: GPU Accelerated Cauchy Reed-Solomon Coding , 2018, IEEE Transactions on Parallel and Distributed Systems.

[33]  Georgiadis Georgios,et al.  Accelerating Convolutional Neural Networks via Activation Map Compression , 2018, CVPR 2019.

[34]  Fang Liu,et al.  Learning Intrinsic Sparse Structures within Long Short-term Memory , 2017, ICLR.

[35]  Shaohuai Shi,et al.  A Distributed Synchronous SGD Algorithm with Global Top-k Sparsification for Low Bandwidth Networks , 2019, 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS).

[36]  Georgios Georgiadis,et al.  Accelerating Convolutional Neural Networks via Activation Map Compression , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Gagan Agrawal,et al.  A novel data transformation and execution strategy for accelerating sparse matrix multiplication on GPUs , 2020, PPoPP.

[38]  Xu Sun,et al.  Training Simplification and Model Simplification for Deep Learning : A Minimal Effort Back Propagation Method , 2017, IEEE Transactions on Knowledge and Data Engineering.

[39]  Erich Elsen,et al.  Sparse GPU Kernels for Deep Learning , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[40]  Olfa Hamdi-Larbi,et al.  Performance Evaluation of Algorithms for Sparse-Dense Matrix Product , 2020 .

[41]  Erich Elsen,et al.  Fast Sparse ConvNets , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Martin Winter,et al.  spECK: accelerating GPU sparse matrix-matrix multiplication through lightweight analysis , 2020, PPoPP.

[43]  Xiaowen Chu,et al.  Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply , 2020, 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[44]  S. Ezouaoui,et al.  Performance Evaluation of Algorithms for Sparse-Dense Matrix Product , .