论文信息 - Design Principles for Sparse Matrix Multiplication on the GPU

Design Principles for Sparse Matrix Multiplication on the GPU

We implement two novel algorithms for sparse-matrix dense-matrix multiplication (SpMM) on the GPU. Our algorithms expect the sparse input in the popular compressed-sparse-row (CSR) format and thus do not require expensive format conversion. While previous SpMM work concentrates on thread-level parallelism, we additionally focus on latency hiding with instruction-level parallelism and load-balancing. We show, both theoretically and experimentally, that the proposed SpMM is a better fit for the GPU than previous approaches. We identify a key memory access pattern that allows efficient access into both input and output matrices that is crucial to getting excellent performance on SpMM. By combining these two ingredients---(i) merge-based load-balancing and (ii) row-major coalesced memory access---we demonstrate a 3.6x peak speedup and a 23.5% geomean speedup over state-of-the-art SpMM implementations on real-world datasets.

[1] Andrew V. Knyazev,et al. Toward the Optimal Preconditioned Eigensolver: Locally Optimal Block Preconditioned Conjugate Gradient Method , 2001, SIAM J. Sci. Comput..

[2] John R. Gilbert,et al. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks , 2009, SPAA '09.

[3] E M Garzón,et al. A matrix approach to tomographic reconstruction and its implementation on GPUs. , 2010, Journal of structural biology.

[4] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[5] Ümit V. Çatalyürek,et al. Regularizing graph centrality computations , 2015, J. Parallel Distributed Comput..

[6] Alexander Tiskin,et al. All-Pairs Shortest Paths Computation in the BSP Model , 2001, ICALP.

[7] Samuel Williams,et al. Optimizing Sparse Matrix-Multiple Vectors Multiplication for Nuclear Configuration Interaction Calculations , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[8] Srinivasan Parthasarathy,et al. Efficient sparse-matrix multi-vector product on GPUs , 2018, HPDC.

[9] Efstratios Gallopoulos,et al. An Iterative Method for Nonsymmetric Systems with Multiple Right-Hand Sides , 1995, SIAM J. Sci. Comput..

[10] Song Han,et al. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[11] Robert N. M. Watson,et al. Into the depths of C: elaborating the de facto standards , 2016, PLDI.

[12] Pradeep Ravikumar,et al. Large Scale Distributed Sparse Precision Estimation , 2013, NIPS.

[13] Michael Garland,et al. Merge-Based Parallel Sparse Matrix-Vector Multiplication , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[14] Francisco Vázquez,et al. FastSpMM: An Efficient Library for Sparse Matrix Matrix Product on GPUs , 2014, Comput. J..

[15] Luke N. Olson,et al. Optimizing Sparse Matrix—Matrix Multiplication for the GPU , 2015, ACM Trans. Math. Softw..

[16] Scott McMillan,et al. Design of the GraphBLAS API for C , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[17] Jack Dongarra,et al. Templates for the Solution of Algebraic Eigenvalue Problems , 2000, Software, environments, tools.

[18] Davide Barbieri,et al. Sparse Matrix-Vector Multiplication on GPGPUs , 2017, ACM Trans. Math. Softw..

[19] Michael Garland,et al. Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[20] Riko Jacob,et al. The I/O Complexity of Sparse Matrix Dense Matrix Multiplication , 2010, LATIN.

[21] Haesun Park,et al. A high-performance parallel algorithm for nonnegative matrix factorization , 2015, PPoPP.

[22] Jack J. Dongarra,et al. Accelerating the LOBPCG method on GPUs using a blocked sparse matrix vector product , 2015, SpringSim.

[23] Timothy A. Davis,et al. The university of Florida sparse matrix collection , 2011, TOMS.

[24] Inderjit S. Dhillon,et al. Multi-Scale Spectral Decomposition of Massive Graphs , 2014, NIPS.

[25] Maurice Herlihy,et al. Warp-aware trace scheduling for GPUs , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).