Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication

Sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high-performance graph algorithms as well as for some linear solvers, such as algebraic multigrid. The scaling of existing parallel implementations of SpGEMM is heavily bound by communication. Even though 3D (or 2.5D) algorithms have been proposed and theoretically analyzed in the flat MPI model on Erdos-Renyi matrices, those algorithms had not been implemented in practice and their complexities had not been analyzed for the general case. In this work, we present the first ever implementation of the 3D SpGEMM formulation that also exploits multiple (intra-node and inter-node) levels of parallelism, achieving significant speedups over the state-of-the-art publicly available codes at all levels of concurrencies. We extensively evaluate our implementation and identify bottlenecks that should be subject to further research.

[1]  John R. Gilbert,et al.  Parallel Triangle Counting and Enumeration Using Matrix Algebra , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[2]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[3]  Luke N. Olson,et al.  Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods , 2012, SIAM J. Sci. Comput..

[4]  Samuel Williams,et al.  Parallel processing of filtered queries in attributed semantic graphs , 2015, J. Parallel Distributed Comput..

[5]  Joost VandeVondele,et al.  Sparse matrix multiplication: The distributed block-compressed sparse row library , 2014, Parallel Comput..

[6]  Oded Schwartz,et al.  Hypergraph Partitioning for Parallel Sparse Matrix-Matrix Multiplication , 2015, SPAA.

[7]  James Demmel,et al.  Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.

[8]  Robert A. van de Geijn,et al.  Collective communication: theory, practice, and experience , 2007, Concurr. Comput. Pract. Exp..

[9]  Bruce Hendrickson,et al.  A Multi-Level Algorithm For Partitioning Graphs , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[10]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[11]  Pradeep Dubey,et al.  Parallel Efficient Sparse Matrix-Matrix Multiplication on Multicore Platforms , 2015, ISC.

[12]  John R. Gilbert,et al.  Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication , 2008, 2008 37th International Conference on Parallel Processing.

[13]  Barry Smith,et al.  Sparse Matrix-Matrix Products Executed Through Coloring , 2015, SIAM J. Matrix Anal. Appl..

[14]  Michael Luby,et al.  A simple parallel algorithm for the maximal independent set problem , 1985, STOC '85.

[15]  Sivasankaran Rajamanickam,et al.  Scalable matrix computations on large scale-free graphs using 2D graph partitioning , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[16]  John R. Gilbert,et al.  The Combinatorial BLAS: design, implementation, and applications , 2011, Int. J. High Perform. Comput. Appl..

[17]  Kadir Akbudak,et al.  Simultaneous Input and Output Matrix Partitioning for Outer-Product-Parallel Sparse Matrix-Matrix Multiplication , 2014, SIAM J. Sci. Comput..

[18]  Christos Faloutsos,et al.  R-MAT: A Recursive Model for Graph Mining , 2004, SDM.

[19]  Kohn,et al.  Density functional and density matrix method scaling linearly with the number of atoms. , 1996, Physical review letters.

[20]  Sivasankaran Rajamanickam,et al.  Towards Extreme-Scale Simulations for Low Mach Fluids with Second-Generation Trilinos , 2014, Parallel Process. Lett..

[21]  Raphael Yuster,et al.  Detecting short directed cycles using rectangular matrix multiplication and dynamic programming , 2004, SODA '04.

[22]  Vijay P. Kumar,et al.  Analyzing Scalability of Parallel Algorithms and Architectures , 1994, J. Parallel Distributed Comput..

[23]  Viral B. Shah,et al.  Implementing Sparse Matrices for Graph Algorithms , 2011, Graph Algorithms in the Language of Linear Algebra.

[24]  Tamara G. Kolda,et al.  An overview of the Trilinos project , 2005, TOMS.

[25]  Uwe Naumann,et al.  GPU-Accelerated Sparse Matrix-Matrix Multiplication by Iterative Row Merging , 2015, SIAM J. Sci. Comput..

[26]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[27]  John Shalf,et al.  Exascale Computing Trends: Adjusting to the "New Normal"' for Computer Architecture , 2013, Computing in Science & Engineering.

[28]  Luke N. Olson,et al.  Optimizing Sparse Matrix—Matrix Multiplication for the GPU , 2015, ACM Trans. Math. Softw..

[29]  John R. Gilbert,et al.  Sparse Matrices in MATLAB: Design and Implementation , 1992, SIAM J. Matrix Anal. Appl..

[30]  Fred G. Gustavson,et al.  Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition , 1978, TOMS.

[31]  Timothy A. Davis,et al.  Direct methods for sparse linear systems , 2006, Fundamentals of algorithms.

[32]  Ichitaro Yamazaki,et al.  On Techniques to Improve Robustness and Scalability of a Parallel Hybrid Linear Solver , 2010, VECPAR.

[33]  Leslie G. Valiant,et al.  Optimally universal parallel computers , 1988 .

[34]  John R. Gilbert,et al.  A Unified Framework for Numerical and Combinatorial Computing , 2008, Computing in Science & Engineering.

[35]  John R. Gilbert,et al.  Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments , 2011, SIAM J. Sci. Comput..

[36]  Sartaj Sahni,et al.  Parallel Matrix and Graph Algorithms , 1981, SIAM J. Comput..

[37]  Tinkara Toš,et al.  Graph Algorithms in the Language of Linear Algebra , 2012, Software, environments, tools.

[38]  S. Dongen Graph clustering by flow simulation , 2000 .

[39]  Brian Vinter,et al.  A framework for general sparse matrix-matrix multiplication on GPUs and heterogeneous processors , 2015, J. Parallel Distributed Comput..

[40]  James Demmel,et al.  Communication optimal parallel multiplication of sparse random matrices , 2013, SPAA.

[41]  Michael Stonebraker,et al.  Standards for graph algorithm primitives , 2014, 2013 IEEE High Performance Extreme Computing Conference (HPEC).

[42]  Matt Challacombe,et al.  An Optimized Sparse Approximate Matrix Multiply for Matrices with Decay , 2012, SIAM J. Sci. Comput..

[43]  Peter Kulchyski and , 2015 .

[44]  John R. Gilbert,et al.  On the representation and multiplication of hypersparse matrices , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[45]  John R. Gilbert,et al.  An interactive system for combinatorial scientific computing with an emphasis on programmer productivity , 2007 .