Design Principles for Sparse Matrix Multiplication on the GPU

We implement two novel algorithms for sparse-matrix dense-matrix multiplication (SpMM) on the GPU. Our algorithms expect the sparse input in the popular compressed-sparse-row (CSR) format and thus do not require expensive format conversion. While previous SpMM work concentrates on thread-level parallelism, we additionally focus on latency hiding with instruction-level parallelism and load-balancing. We show, both theoretically and experimentally, that the proposed SpMM is a better fit for the GPU than previous approaches. We identify a key memory access pattern that allows efficient access into both input and output matrices that is crucial to getting excellent performance on SpMM. By combining these two ingredients---(i) merge-based load-balancing and (ii) row-major coalesced memory access---we demonstrate a 3.6x peak speedup and a 23.5% geomean speedup over state-of-the-art SpMM implementations on real-world datasets.

[1]  Andrew V. Knyazev,et al.  Toward the Optimal Preconditioned Eigensolver: Locally Optimal Block Preconditioned Conjugate Gradient Method , 2001, SIAM J. Sci. Comput..

[2]  John R. Gilbert,et al.  Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks , 2009, SPAA '09.

[3]  E M Garzón,et al.  A matrix approach to tomographic reconstruction and its implementation on GPUs. , 2010, Journal of structural biology.

[4]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Ümit V. Çatalyürek,et al.  Regularizing graph centrality computations , 2015, J. Parallel Distributed Comput..

[6]  Alexander Tiskin,et al.  All-Pairs Shortest Paths Computation in the BSP Model , 2001, ICALP.

[7]  Samuel Williams,et al.  Optimizing Sparse Matrix-Multiple Vectors Multiplication for Nuclear Configuration Interaction Calculations , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[8]  Srinivasan Parthasarathy,et al.  Efficient sparse-matrix multi-vector product on GPUs , 2018, HPDC.

[9]  Efstratios Gallopoulos,et al.  An Iterative Method for Nonsymmetric Systems with Multiple Right-Hand Sides , 1995, SIAM J. Sci. Comput..

[10]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[11]  Robert N. M. Watson,et al.  Into the depths of C: elaborating the de facto standards , 2016, PLDI.

[12]  Pradeep Ravikumar,et al.  Large Scale Distributed Sparse Precision Estimation , 2013, NIPS.

[13]  Michael Garland,et al.  Merge-Based Parallel Sparse Matrix-Vector Multiplication , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Francisco Vázquez,et al.  FastSpMM: An Efficient Library for Sparse Matrix Matrix Product on GPUs , 2014, Comput. J..

[15]  Luke N. Olson,et al.  Optimizing Sparse Matrix—Matrix Multiplication for the GPU , 2015, ACM Trans. Math. Softw..

[16]  Scott McMillan,et al.  Design of the GraphBLAS API for C , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[17]  Jack Dongarra,et al.  Templates for the Solution of Algebraic Eigenvalue Problems , 2000, Software, environments, tools.

[18]  Davide Barbieri,et al.  Sparse Matrix-Vector Multiplication on GPGPUs , 2017, ACM Trans. Math. Softw..

[19]  Michael Garland,et al.  Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[20]  Riko Jacob,et al.  The I/O Complexity of Sparse Matrix Dense Matrix Multiplication , 2010, LATIN.

[21]  Haesun Park,et al.  A high-performance parallel algorithm for nonnegative matrix factorization , 2015, PPoPP.

[22]  Jack J. Dongarra,et al.  Accelerating the LOBPCG method on GPUs using a blocked sparse matrix vector product , 2015, SpringSim.

[23]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[24]  Inderjit S. Dhillon,et al.  Multi-Scale Spectral Decomposition of Massive Graphs , 2014, NIPS.

[25]  Maurice Herlihy,et al.  Warp-aware trace scheduling for GPUs , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).