Distributed-memory parallel algorithms for sparse times tall-skinny-dense matrix multiplication

Sparse times dense matrix multiplication (SpMM) finds its applications in well-established fields such as computational linear algebra as well as emerging fields such as graph neural networks. In this study, we evaluate the performance of various techniques for performing SpMM as a distributed computation across many nodes by focusing on GPU accelerators. We examine how the actual local computational performance of state-of-the-art SpMM implementations affect computational efficiency as dimensions change when we scale to large numbers of nodes, which proves to be an unexpectedly important bottleneck. We also consider various distribution strategies, including A-Stationary, B-Stationary, and C-Stationary algorithms, 1.5D and 2D algorithms, and RDMA-based and bulk synchronous methods of data transfer. Our results show that the best choice of algorithm and implementation technique depends not only on the cost of communication for particular matrix sizes and dimensions, but also on the performance of local SpMM operations. Our evaluations reveal that with the involvement of GPU accelerators, the best design choices for SpMM differ from the conventional algorithms that are known to perform well for dense matrix-matrix or sparse matrix-sparse matrix multiplies.

[1]  Lynn Elliot Cannon,et al.  A cellular computer to implement the kalman filter algorithm , 1969 .

[2]  Alok Aggarwal,et al.  Communication Complexity of PRAMs , 1990, Theor. Comput. Sci..

[3]  Ramesh C. Agarwal,et al.  A three-dimensional approach to parallel matrix multiplication , 1995, IBM J. Res. Dev..

[4]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[5]  Andrew V. Knyazev,et al.  Toward the Optimal Preconditioned Eigensolver: Locally Optimal Block Preconditioned Conjugate Gradient Method , 2001, SIAM J. Sci. Comput..

[6]  Robert A. van de Geijn,et al.  Collective communication: theory, practice, and experience: Research Articles , 2007 .

[7]  John R. Gilbert,et al.  On the representation and multiplication of hypersparse matrices , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[8]  James Demmel,et al.  Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.

[9]  John R. Gilbert,et al.  The Combinatorial BLAS: design, implementation, and applications , 2011, Int. J. High Perform. Comput. Appl..

[10]  John R. Gilbert,et al.  Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments , 2011, SIAM J. Sci. Comput..

[11]  James Demmel,et al.  Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[12]  Samuel Williams,et al.  Optimizing Sparse Matrix-Multiple Vectors Multiplication for Nuclear Configuration Interaction Calculations , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[13]  Samuel Williams,et al.  Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication , 2015, SIAM J. Sci. Comput..

[14]  Cevdet Aykanat,et al.  Improving performance of sparse matrix dense matrix multiplication on large-scale parallel systems , 2016, Parallel Comput..

[15]  Martin D. Schatz,et al.  Parallel Matrix Multiplication: A Systematic Journey , 2016, SIAM J. Sci. Comput..

[16]  Leonid Oliker,et al.  Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[17]  Alexander Heinecke,et al.  LIBXSMM: Accelerating Small Matrix Multiplications by Runtime Code Generation , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  John D. Owens,et al.  Design Principles for Sparse Matrix Multiplication on the GPU , 2018, Euro-Par.

[19]  Haesun Park,et al.  MPI-FAUN: An MPI-Based Framework for Alternating-Updating Nonnegative Matrix Factorization , 2016, IEEE Transactions on Knowledge and Data Engineering.

[20]  Georgios A. Pavlopoulos,et al.  HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks , 2018, Nucleic acids research.

[21]  Weifeng Liu,et al.  Register-Aware Optimizations for Parallel Sparse Matrix–Matrix Multiplication , 2019, International Journal of Parallel Programming.

[22]  P. Sadayappan,et al.  Adaptive sparse tiling for sparse matrix multiplication , 2019, PPoPP.

[23]  Katherine Yelick,et al.  BCL: A Cross-Platform Distributed Data Structures Library , 2018, ICPP.

[24]  Torsten Hoefler,et al.  Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication , 2019, SC.

[25]  Minjie Wang,et al.  FeatGraph: A Flexible and Efficient Backend for Graph Neural Network Systems , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[26]  K. Yelick,et al.  Reducing Communication in Graph Neural Network Training , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[27]  Erich Elsen,et al.  Sparse GPU Kernels for Deep Learning , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[28]  Kyungyong Lee,et al.  Performance Prediction of Sparse Matrix Multiplication on a Distributed BigData Processing Environment , 2020, 2020 IEEE International Conference on Autonomic Computing and Self-Organizing Systems Companion (ACSOS-C).

[29]  Erich Elsen,et al.  Fast Sparse ConvNets , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Yu Wang,et al.  GE-SpMM: General-Purpose Sparse Matrix-Matrix Multiplication on GPUs for Graph Neural Networks , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[31]  Ziheng Wang,et al.  SparseRT: Accelerating Unstructured Sparsity on GPUs for Deep Learning Inference , 2020, PACT.

[32]  Süreyya Emre Kurt,et al.  Efficient Tiled Sparse Matrix Multiplication through Matrix Signatures , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.