Communication-optimal parallel algorithm for strassen's matrix multiplication

Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication. The algorithm outperforms all known parallel matrix multiplication algorithms, classical and Strassen-based, both asymptotically and in practice. A critical bottleneck in parallelizing Strassen's algorithm is the communication between the processors. Ballard, Demmel, Holtz, and Schwartz (SPAA '11) prove lower bounds on these communication costs, using expansion properties of the underlying computation graph. Our algorithm matches these lower bounds, and so is communication-optimal. It exhibits perfect strong scaling within the maximum possible range. Benchmarking our implementation on a Cray XT4, we obtain speedups over classical and Strassen-based algorithms ranging from 24% to 184% for a fixed matrix dimension n=94080, where the number of processors ranges from 49 to 7203. Our parallelization approach generalizes to other fast matrix multiplication algorithms.

[1]  Qingshan Luo,et al.  A scalable parallel Strassen's matrix multiplication algorithm for distributed-memory computers , 1995, SAC '95.

[2]  Dario Bini Relations between exact and approximate bilinear algorithms. Applications , 1980 .

[3]  Frédéric Suter,et al.  Impact of mixed‐parallelism on parallel implementations of the Strassen and Winograd matrix multiplication algorithms , 2004, Concurr. Pract. Exp..

[4]  Robert L. Probert On the Additive Complexity of Matrix Multiplication , 1976, SIAM J. Comput..

[5]  Robert A. van de Geijn,et al.  A High Performance Parallel Strassen Implementation , 1995, Parallel Process. Lett..

[6]  Ramesh C. Agarwal,et al.  A three-dimensional approach to parallel matrix multiplication , 1995, IBM J. Res. Dev..

[7]  Nader H. Bshouty,et al.  On the Additive Complexity of 2 x 2 Matrix Multiplication , 1995, Inf. Process. Lett..

[8]  Jaeyoung Choi,et al.  Pumma: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers , 1994, Concurr. Pract. Exp..

[9]  Telecommunications Board The Future of Computing Performance: Game Over or Next Level? , 2011 .

[10]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[11]  Bharat Kumar,et al.  A tensor product formulation of Strassen's matrix multiplication algorithm with memory reduction , 1995 .

[12]  DemmelJames,et al.  Graph expansion and communication costs of fast matrix multiplication , 2013 .

[13]  Mei Han An,et al.  accuracy and stability of numerical algorithms , 1991 .

[14]  Jarle Berntsen,et al.  Communication efficient matrix multiplication on hypercubes , 1989, Parallel Comput..

[15]  Thomas Rauber,et al.  Combining building blocks for parallel multi-level matrix multiplication , 2008, Parallel Comput..

[16]  Alok Aggarwal,et al.  Communication Complexity of PRAMs , 1990, Theor. Comput. Sci..

[17]  Lynn Elliot Cannon,et al.  A cellular computer to implement the kalman filter algorithm , 1969 .

[18]  Victor Y. Pan,et al.  New Fast Algorithms for Matrix Operations , 1980, SIAM J. Comput..

[19]  Christopher Umans Group-theoretic algorithms for matrix multiplication , 2005, 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05).

[20]  Jaeyoung Choi A new parallel matrix multiplication algorithm on distributed-memory concurrent computers , 1998, Concurr. Pract. Exp..

[21]  James Demmel,et al.  Graph expansion and communication costs of fast matrix multiplication: regular submission , 2011, SPAA '11.

[22]  Barton P. Miller,et al.  Critical path analysis for the execution of parallel and distributed programs , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[23]  Jack Dongarra,et al.  ScaLAPACK Users' Guide , 1987 .

[24]  Arnold Schönhage,et al.  Partial and Total Matrix Multiplication , 1981, SIAM J. Comput..

[25]  Jack Dongarra,et al.  Experiments with Strassen's Algorithm: From Sequential to Parallel , 2006 .

[26]  V. Strassen Gaussian elimination is not optimal , 1969 .

[27]  Telecommunications Board,et al.  Getting Up to Speed: The Future of Supercomputing , 2005 .

[28]  James Demmel,et al.  Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds , 2012, SPAA '12.

[29]  James Demmel,et al.  Fast linear algebra is stable , 2006, Numerische Mathematik.

[30]  Ran Raz,et al.  On the complexity of matrix product , 2002, STOC '02.

[31]  Don Coppersmith,et al.  Matrix multiplication via arithmetic progressions , 1987, STOC.

[32]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[33]  James Demmel,et al.  Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..

[34]  Don Coppersmith,et al.  On the Asymptotic Complexity of Matrix Multiplication , 1982, SIAM J. Comput..

[35]  V. Strassen Relative bilinear complexity and matrix multiplication. , 1987 .

[36]  James Demmel,et al.  Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.

[37]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[38]  Shmuel Winograd,et al.  On multiplication of 2 × 2 matrices , 1971 .

[39]  Frédéric Suter,et al.  Impact of mixed-parallelism on parallel implementations of the Strassen and Winograd matrix multiplication algorithms: Research Articles , 2004 .

[40]  Francesco Romani,et al.  Some Properties of Disjoint Sums of Tensors Related to Matrix Multiplication , 1982, SIAM J. Comput..

[41]  James Demmel,et al.  Fast matrix multiplication is stable , 2006, Numerische Mathematik.

[42]  Samuel H. Fuller,et al.  The Future of Computing Performance: Game Over or Next Level? , 2014 .