论文信息 - Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication

Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication

We propose COSMA: a parallel matrix-matrix multiplication algorithm that is near communication-optimal for all combinations of matrix dimensions, processor counts, and memory sizes. The key idea behind COSMA is to derive an optimal (up to a factor of 0.03% for 10MB of fast memory) sequential schedule and then parallelize it, preserving I/O optimality. To achieve this, we use the red-blue pebble game to precisely model MMM dependencies and derive a constructive and tight sequential and parallel I/O lower bound proofs. Compared to 2D or 3D algorithms, which fix processor decomposition upfront and then map it to the matrix dimensions, it reduces communication volume by up to √ times. COSMA outperforms the established ScaLAPACK, CARMA, and CTF algorithms in all scenarios up to 12.8x (2.2x on average), achieving up to 88% of Piz Daint's peak performance. Our work does not require any hand tuning and is maintained as an open source implementation.

[1] Ramesh C. Agarwal,et al. A three-dimensional approach to parallel matrix multiplication , 1995, IBM J. Res. Dev..

[2] V. Strassen. Gaussian elimination is not optimal , 1969 .

[3] Robert A. van de Geijn,et al. Pushing the Bounds for Matrix-Matrix Multiplication , 2017, ArXiv.

[4] Torsten Hoefler,et al. Enabling highly-scalable remote memory access programming with MPI-3 one sided , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[5] John Cocke,et al. Register Allocation Via Coloring , 1981, Comput. Lang..

[6] John R. Gilbert,et al. Parallel Triangle Counting and Enumeration Using Matrix Algebra , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[7] Michael Wolfe,et al. More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[8] Jaeyoung Choi,et al. Pumma: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers , 1994, Concurr. Pract. Exp..

[9] Peter Messmer,et al. Enabling simulation at the fifth rung of DFT: Large scale RPA calculations with excellent time to solution , 2015, Comput. Phys. Commun..

[10] Françoise Chatelin. Eigenvalues of Matrices: Revised Edition , 2012 .

[11] Dror Irony,et al. Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[12] Michael A. Bender,et al. Optimal Sparse Matrix Dense Vector Multiplication in the I/O-Model , 2007, SPAA '07.

[13] Torsten Hoefler,et al. SlimSell: A Vectorizable Graph Representation for Breadth-First Search , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[14] Siu Man Chan. Just a Pebble Game , 2013, 2013 IEEE Conference on Computational Complexity.

[15] Charles E. Leiserson,et al. Cache-Oblivious Algorithms , 2003, CIAC.

[16] Torsten Hoefler,et al. Remote Memory Access Programming in MPI-3 , 2015, TOPC.

[17] Jaeyoung Choi,et al. Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines , 1994, Sci. Program..

[18] Oded Schwartz,et al. Matrix Multiplication I/O-Complexity by Path Routing , 2015, SPAA.

[19] Jeffrey S. Vetter,et al. Statistical scalability analysis of communication operations in distributed applications , 2001, PPoPP '01.

[20] Torsten Hoefler,et al. Scaling Betweenness Centrality using Communication-Efficient Sparse Matrix Multiplication , 2016, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[21] Hyuk-Jae Lee,et al. Generalized Cannon's algorithm for parallel matrix multiplication , 1997, ICS '97.

[22] Lynn Elliot Cannon,et al. A cellular computer to implement the kalman filter algorithm , 1969 .

[23] Alain Darte. On the Complexity of Loop Fusion , 2000, Parallel Comput..

[24] Alexandru Nicolau,et al. Using Recursion to Boost ATLAS's Performance , 2005, ISHPC.

[25] James Demmel,et al. Trade-Offs Between Synchronization, Communication, and Computation in Parallel Linear Algebra Computations , 2016 .

[26] John D. Lafferty,et al. Convergence Analysis for Rectangular Matrix Completion Using Burer-Monteiro Factorization and Gradient Descent , 2016, ArXiv.

[27] James Demmel,et al. Cyclops Tensor Framework: Reducing Communication and Eliminating Load Imbalance in Massively Parallel Contractions , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[28] M. A. Bender. Optimal sparse matrix dense vector multiplication in the I/O-model , 2006 .

[29] Ravi Sethi,et al. Complete register allocation problems , 1973, SIAM J. Comput..

[30] Vipin Kumar,et al. Scalability of Parallel Algorithms for Matrix Multiplication , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[31] Michael I. Jordan,et al. On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[32] Nicholas J. Higham,et al. Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[33] Robert A. van de Geijn,et al. SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[34] John Cocke,et al. A methodology for the real world , 1981 .

[35] James Demmel,et al. Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[36] Dingwen Tao,et al. TSM2: optimizing tall-and-skinny matrix-matrix multiplication on GPUs , 2019, ICS.

[37] James Demmel,et al. ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[38] Sraban Kumar Mohanty. I/O Efficient Algorithms for Matrix Computations , 2010, ArXiv.

[39] James Demmel,et al. Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.

[40] Jack J. Dongarra,et al. Efficient implementation of quantum materials simulations on distributed CPU-GPU systems , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[41] Quanquan C. Liu. Red-blue and standard pebble games : complexity and applications in the sequential and parallel models , 2017 .

[42] Franz Franchetti,et al. Mathematical foundations of the GraphBLAS , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[43] G. C. Fox,et al. Solving Problems on Concurrent Processors , 1988 .

[44] Alok Aggarwal,et al. The input/output complexity of sorting and related problems , 1988, CACM.

[45] H. T. Kung,et al. I/O complexity: The red-blue pebble game , 1981, STOC '81.

[46] Xin-She Yang,et al. Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[47] Alfio Lazzaro,et al. Increasing the Efficiency of Sparse Matrix-Matrix Multiplication with a 2.5D Algorithm and One-Sided MPI , 2017, PASC.

[48] Monica S. Lam,et al. A data locality optimizing algorithm , 1991, PLDI '91.

[49] Jack Dongarra,et al. ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers , 1992, [Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation.

[50] H. Whitney,et al. An inequality related to the isoperimetric inequality , 1949 .

[51] Carl D. Meyer,et al. Matrix Analysis and Applied Linear Algebra , 2000 .

[52] Gero Greiner,et al. Sparse Matrix Computations and their I/O Complexity , 2012 .

[53] James Demmel,et al. Tradeoffs between synchronization, communication, and computation in parallel linear algebra computations , 2014, SPAA.

[54] Matteo Frigo,et al. An analysis of dag-consistent distributed shared-memory algorithms , 1996, SPAA '96.

[55] Torsten Hoefler,et al. Demystifying Parallel and Distributed Deep Learning , 2018, ACM Comput. Surv..

[56] Sivan Toledo,et al. A survey of out-of-core algorithms in numerical linear algebra , 1999, External Memory Algorithms.

[57] Robert E. Tarjan,et al. The pebbling problem is complete in polynomial space , 1979, SIAM J. Comput..

[58] Leonid Oliker,et al. Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[59] Geoffrey C. Fox,et al. Matrix algorithms on a hypercube I: Matrix multiplication , 1987, Parallel Comput..

[60] Ken Kennedy,et al. Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.