Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication

We propose COSMA: a parallel matrix-matrix multiplication algorithm that is near communication-optimal for all combinations of matrix dimensions, processor counts, and memory sizes. The key idea behind COSMA is to derive an optimal (up to a factor of 0.03% for 10MB of fast memory) sequential schedule and then parallelize it, preserving I/O optimality. To achieve this, we use the red-blue pebble game to precisely model MMM dependencies and derive a constructive and tight sequential and parallel I/O lower bound proofs. Compared to 2D or 3D algorithms, which fix processor decomposition upfront and then map it to the matrix dimensions, it reduces communication volume by up to √ times. COSMA outperforms the established ScaLAPACK, CARMA, and CTF algorithms in all scenarios up to 12.8x (2.2x on average), achieving up to 88% of Piz Daint's peak performance. Our work does not require any hand tuning and is maintained as an open source implementation.

[1]  Ramesh C. Agarwal,et al.  A three-dimensional approach to parallel matrix multiplication , 1995, IBM J. Res. Dev..

[2]  V. Strassen Gaussian elimination is not optimal , 1969 .

[3]  Robert A. van de Geijn,et al.  Pushing the Bounds for Matrix-Matrix Multiplication , 2017, ArXiv.

[4]  Torsten Hoefler,et al.  Enabling highly-scalable remote memory access programming with MPI-3 one sided , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[5]  John Cocke,et al.  Register Allocation Via Coloring , 1981, Comput. Lang..

[6]  John R. Gilbert,et al.  Parallel Triangle Counting and Enumeration Using Matrix Algebra , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[7]  Michael Wolfe,et al.  More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[8]  Jaeyoung Choi,et al.  Pumma: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers , 1994, Concurr. Pract. Exp..

[9]  Peter Messmer,et al.  Enabling simulation at the fifth rung of DFT: Large scale RPA calculations with excellent time to solution , 2015, Comput. Phys. Commun..

[10]  Françoise Chatelin Eigenvalues of Matrices: Revised Edition , 2012 .

[11]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[12]  Michael A. Bender,et al.  Optimal Sparse Matrix Dense Vector Multiplication in the I/O-Model , 2007, SPAA '07.

[13]  Torsten Hoefler,et al.  SlimSell: A Vectorizable Graph Representation for Breadth-First Search , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[14]  Siu Man Chan Just a Pebble Game , 2013, 2013 IEEE Conference on Computational Complexity.

[15]  Charles E. Leiserson,et al.  Cache-Oblivious Algorithms , 2003, CIAC.

[16]  Torsten Hoefler,et al.  Remote Memory Access Programming in MPI-3 , 2015, TOPC.

[17]  Jaeyoung Choi,et al.  Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines , 1994, Sci. Program..

[18]  Oded Schwartz,et al.  Matrix Multiplication I/O-Complexity by Path Routing , 2015, SPAA.

[19]  Jeffrey S. Vetter,et al.  Statistical scalability analysis of communication operations in distributed applications , 2001, PPoPP '01.

[20]  Torsten Hoefler,et al.  Scaling Betweenness Centrality using Communication-Efficient Sparse Matrix Multiplication , 2016, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[21]  Hyuk-Jae Lee,et al.  Generalized Cannon's algorithm for parallel matrix multiplication , 1997, ICS '97.

[22]  Lynn Elliot Cannon,et al.  A cellular computer to implement the kalman filter algorithm , 1969 .

[23]  Alain Darte On the Complexity of Loop Fusion , 2000, Parallel Comput..

[24]  Alexandru Nicolau,et al.  Using Recursion to Boost ATLAS's Performance , 2005, ISHPC.

[25]  James Demmel,et al.  Trade-Offs Between Synchronization, Communication, and Computation in Parallel Linear Algebra Computations , 2016 .

[26]  John D. Lafferty,et al.  Convergence Analysis for Rectangular Matrix Completion Using Burer-Monteiro Factorization and Gradient Descent , 2016, ArXiv.

[27]  James Demmel,et al.  Cyclops Tensor Framework: Reducing Communication and Eliminating Load Imbalance in Massively Parallel Contractions , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[28]  M. A. Bender Optimal sparse matrix dense vector multiplication in the I/O-model , 2006 .

[29]  Ravi Sethi,et al.  Complete register allocation problems , 1973, SIAM J. Comput..

[30]  Vipin Kumar,et al.  Scalability of Parallel Algorithms for Matrix Multiplication , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[31]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[32]  Nicholas J. Higham,et al.  Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[33]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[34]  John Cocke,et al.  A methodology for the real world , 1981 .

[35]  James Demmel,et al.  Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[36]  Dingwen Tao,et al.  TSM2: optimizing tall-and-skinny matrix-matrix multiplication on GPUs , 2019, ICS.

[37]  James Demmel,et al.  ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[38]  Sraban Kumar Mohanty I/O Efficient Algorithms for Matrix Computations , 2010, ArXiv.

[39]  James Demmel,et al.  Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.

[40]  Jack J. Dongarra,et al.  Efficient implementation of quantum materials simulations on distributed CPU-GPU systems , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[41]  Quanquan C. Liu Red-blue and standard pebble games : complexity and applications in the sequential and parallel models , 2017 .

[42]  Franz Franchetti,et al.  Mathematical foundations of the GraphBLAS , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[43]  G. C. Fox,et al.  Solving Problems on Concurrent Processors , 1988 .

[44]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[45]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[46]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[47]  Alfio Lazzaro,et al.  Increasing the Efficiency of Sparse Matrix-Matrix Multiplication with a 2.5D Algorithm and One-Sided MPI , 2017, PASC.

[48]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[49]  Jack Dongarra,et al.  ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers , 1992, [Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation.

[50]  H. Whitney,et al.  An inequality related to the isoperimetric inequality , 1949 .

[51]  Carl D. Meyer,et al.  Matrix Analysis and Applied Linear Algebra , 2000 .

[52]  Gero Greiner,et al.  Sparse Matrix Computations and their I/O Complexity , 2012 .

[53]  James Demmel,et al.  Tradeoffs between synchronization, communication, and computation in parallel linear algebra computations , 2014, SPAA.

[54]  Matteo Frigo,et al.  An analysis of dag-consistent distributed shared-memory algorithms , 1996, SPAA '96.

[55]  Torsten Hoefler,et al.  Demystifying Parallel and Distributed Deep Learning , 2018, ACM Comput. Surv..

[56]  Sivan Toledo,et al.  A survey of out-of-core algorithms in numerical linear algebra , 1999, External Memory Algorithms.

[57]  Robert E. Tarjan,et al.  The pebbling problem is complete in polynomial space , 1979, SIAM J. Comput..

[58]  Leonid Oliker,et al.  Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[59]  Geoffrey C. Fox,et al.  Matrix algorithms on a hypercube I: Matrix multiplication , 1987, Parallel Comput..

[60]  Ken Kennedy,et al.  Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.