Reducing Communication Costs for Sparse Matrix Multiplication within Algebraic Multigrid

We consider the sequence of sparse matrix-matrix multiplications performed during the setup phase of algebraic multigrid. In particular, we show that the most commonly used parallel algorithm is often not the most communication-efficient one for all of the matrix-matrix multiplications involved. By using an alternative algorithm, we show that the communication costs are reduced (in theory and practice), and we demonstrate the performance benefit for both model (structured) and more realistic unstructured problems on large-scale distributed-memory parallel systems. Our theoretical analysis shows that we can reduce communication by a factor of up to 5.4 for a model problem, and we observe in our empirical evaluation communication reductions of factors up to 4.7 for structured problems and 3.7 for unstructured problems. These reductions in communication translate to run-time speedups of factors up to 2.8 and 2.5, respectively.

[1]  Oded Schwartz,et al.  Hypergraph Partitioning for Parallel Sparse Matrix-Matrix Multiplication , 2015, SPAA.

[2]  Achi Brandt,et al.  Multigrid Techniques: 1984 Guide with Applications to Fluid Dynamics, Revised Edition , 2011 .

[3]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[4]  Jonathan J. Hu,et al.  ML 5.0 Smoothed Aggregation Users's Guide , 2006 .

[5]  Uwe Naumann,et al.  GPU-Accelerated Sparse Matrix-Matrix Multiplication by Iterative Row Merging , 2015, SIAM J. Sci. Comput..

[6]  Barry Smith,et al.  Sparse Matrix-Matrix Products Executed Through Coloring , 2015, SIAM J. Matrix Anal. Appl..

[7]  Luke N. Olson,et al.  Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods , 2012, SIAM J. Sci. Comput..

[8]  Joost VandeVondele,et al.  Sparse matrix multiplication: The distributed block-compressed sparse row library , 2014, Parallel Comput..

[9]  Tamara G. Kolda,et al.  An overview of the Trilinos project , 2005, TOMS.

[10]  James Demmel,et al.  Communication optimal parallel multiplication of sparse random matrices , 2013, SPAA.

[11]  Michael A. Heroux,et al.  Tpetra, and the use of generic programming in scientific computing , 2012 .

[12]  Sandia Report,et al.  MueLu User's Guide 1.0 (Trilinos version 11.12) , 2014 .

[13]  Kadir Akbudak,et al.  Simultaneous Input and Output Matrix Partitioning for Outer-Product-Parallel Sparse Matrix-Matrix Multiplication , 2014, SIAM J. Sci. Comput..

[14]  Sivasankaran Rajamanickam,et al.  Towards Extreme-Scale Simulations for Low Mach Fluids with Second-Generation Trilinos , 2014, Parallel Process. Lett..

[15]  William L. Briggs,et al.  A multigrid tutorial , 1987 .

[16]  Michael A. Heroux,et al.  A new overview of the Trilinos project , 2012, Sci. Program..

[17]  John R. Gilbert,et al.  Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments , 2011, SIAM J. Sci. Comput..

[18]  Ray S. Tuminaro,et al.  Parallel Smoothed Aggregation Multigrid : Aggregation Strategies on Massively Parallel Machines , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[19]  Wolfgang Hackbusch,et al.  Multi-grid methods and applications , 1985, Springer series in computational mathematics.

[20]  Lynn Elliot Cannon,et al.  A cellular computer to implement the kalman filter algorithm , 1969 .

[21]  M. Challacombe A general parallel sparse-blocked matrix multiply for linear scaling SCF theory , 2000 .

[22]  Luke N. Olson,et al.  Optimizing Sparse Matrix—Matrix Multiplication for the GPU , 2015, ACM Trans. Math. Softw..