Efficient Sparse Matrix Kernels based on Adaptive Workload-Balancing and Parallel-Reduction

Sparse matrix-vector and matrix-matrix multiplication (SpMV and SpMM) are fundamental in both conventional (graph analytics, scientific computing) and emerging (sparse DNN, GNN) domains. Workload-balancing and parallel-reduction are widely-used design principles for efficient SpMV. However, prior work fails to resolve how to implement and adaptively use the two principles for SpMV/MM. To overcome this obstacle, we first complete the implementation space with optimizations by filling three missing pieces in prior work, including: (1) We show that workload-balancing and parallel-reduction can be combined through a segment-reduction algorithm implemented with SIMD-shuffle primitives. (2) We show that parallel-reduction can be implemented in SpMM through loading the dense-matrix rows with vector memory operations. (3) We show that vectorized loading of sparse rows, being a part of the benefit of parallel-reduction, can co-exist with sequential-reduction in SpMM through temporally caching sparse-matrix elements in the shared memory. In terms of adaptive use, we analyze how the benefit of two principles change with two characteristics from the input data space: the diverse sparsity pattern and dense-matrix width. We find the benefit of the two principles fades along with the increased total workload, i.e. the increased dense-matrix width. We also identify, for SpMV and SpMM, different sparse-matrix features that impact workload-balancing effectiveness. Our design consistently exceeds cuSPARSE by 1.07-1.57× on different GPUs and dense matrix width, and the kernel selection rules involve 512% performance loss compared with optimal choices. Our kernel is being integrated into popular graph learning frameworks [1, 2] to accelerate GNN training. 1 1 PROBLEM AND MOTIVATION Efficient basic sparse-matrix primitives can benefit a variety of applications. The sparse matrix multiplication YM×N = AM×KXK×N whereA is sparse andX,Y are dense, is referred to as Sparse MatrixVector product (SpMV, when N = 1) or Sparse Matrix-Matrix product (SpMM, when N > 1). SpMV and SpMM are fundamental components to a wide range of problem domains. SpMV is used in graph analytics and scientific computing [3, 4]. SpMM is used in iterative algorithms for sparse matrix factorization [5]. Recent advances in sparse NN, promising higher computational efficiency than dense models, rely on fast SpMV/MM kernels to demonstrate speedup in practice [6]. SpMM is also a core operation in graph neural networks (GNNs) [7, 8]. Accelerating SpMV/MM on GPUs, the dominating HPC hardware in presence, can potentially boost the performance of many aforementioned applications. 1This project is available at https://github.com/hgyhungry/ge-spmm workload balancing parallel reduction CSR-Scalar [Bell09] CSR-Vector [Bell09]

[1]  Srinivasan Parthasarathy,et al.  Efficient sparse-matrix multi-vector product on GPUs , 2018, HPDC.

[2]  Lingfan Yu,et al.  Deep Graph Library: A Graph-Centric, Highly-Performant Package for Graph Neural Networks. , 2019 .

[3]  Erich Elsen,et al.  Fast Sparse ConvNets , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[5]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[6]  Wei Li,et al.  Tux2: Distributed Graph Computation for Machine Learning , 2017, NSDI.

[7]  Ziheng Wang,et al.  SparseRT: Accelerating Unstructured Sparsity on GPUs for Deep Learning Inference , 2020, PACT.

[8]  Minjie Wang,et al.  FeatGraph: A Flexible and Efficient Backend for Graph Neural Network Systems , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  P. Sadayappan,et al.  Adaptive sparse tiling for sparse matrix multiplication , 2019, PPoPP.

[10]  Joseph L. Greathouse,et al.  Efficient Sparse Matrix-Vector Multiplication on GPUs Using the CSR Storage Format , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Srinivasan Parthasarathy,et al.  Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining , 2011, Proc. VLDB Endow..

[12]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[13]  Christos Faloutsos,et al.  R-MAT: A Recursive Model for Graph Mining , 2004, SDM.

[14]  Dipankar Das,et al.  SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[15]  Michael Garland,et al.  Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[16]  Michael Garland,et al.  Merge-Based Parallel Sparse Matrix-Vector Multiplication , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  Alex Smola,et al.  Deep Graph Library: Towards Efficient and Scalable Deep Learning on Graphs , 2019, ArXiv.

[18]  John D. Owens,et al.  Design Principles for Sparse Matrix Multiplication on the GPU , 2018, Euro-Par.

[19]  Chang Zhou,et al.  CogDL: An Extensive Toolkit for Deep Learning on Graphs , 2021, ArXiv.

[20]  Erich Elsen,et al.  Sparse GPU Kernels for Deep Learning , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.