Communication-Avoiding Parallel Recursive Algorithms for Matrix Multiplication

Abstract : Matrix multiplication is one of the most fundamental algorithmic problems in numerical linear algebra, distributed computing, scientific computing, and high-performance computing. Parallelization of matrix multiplication has been extensively studied (e.g., [21, 12, 24, 2, 51, 39, 36, 23, 45, 61]). It has been addressed using many theoretical approaches, algorithmic tools, and software engineering methods in order to optimize performance and obtain faster and more efficient parallel algorithms and implementations. To design efficient parallel algorithms, it is necessary not only to load balance the computation, but also to minimize the time spent communicating between processors. The interprocessor communication costs are in many cases significantly higher than the computational costs. Moreover, hardware trends predict that more problems will become communication-bound in the future [38, 35]. Even matrix multiplication, which is widely considered to be computation-bound, becomes communication-bound when a given problem is run on sufficiently many processors.

[1]  James Demmel,et al.  Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..

[2]  Jehoshua Bruck,et al.  Efficient algorithms for all-to-all communications in multi-port message-passing systems , 1994, SPAA '94.

[3]  Keshav Pingali,et al.  Automatic Generation of Block-Recursive Codes , 2000, Euro-Par.

[4]  John R. Gilbert,et al.  The Combinatorial BLAS: design, implementation, and applications , 2011, Int. J. High Perform. Comput. Appl..

[5]  Robert L. Probert On the Additive Complexity of Matrix Multiplication , 1976, SIAM J. Comput..

[6]  Alexander Tiskin Communication-efficient parallel generic pairwise elimination , 2007, Future Gener. Comput. Syst..

[7]  Michael Clausen,et al.  Algebraic complexity theory , 1997, Grundlehren der mathematischen Wissenschaften.

[8]  Robert A. van de Geijn,et al.  SUMMA: Scalable Universal Matrix Multiplication Algorithm , 1995 .

[9]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[10]  Victor Y. Pan,et al.  Fast rectangular matrix multiplications and improving parallel matrix computations , 1997, PASCO '97.

[11]  Vijaya Ramachandran,et al.  Oblivious algorithms for multicores and network of processors , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[12]  Nader H. Bshouty,et al.  On the Additive Complexity of 2 x 2 Matrix Multiplication , 1995, Inf. Process. Lett..

[13]  Shmuel Winograd,et al.  On multiplication of 2 × 2 matrices , 1971 .

[14]  James Demmel,et al.  Communication-optimal parallel algorithm for strassen's matrix multiplication , 2012, SPAA '12.

[15]  Jaeyoung Choi,et al.  Pumma: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers , 1994, Concurr. Pract. Exp..

[16]  James Demmel,et al.  Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.

[17]  Victor Y. Pan,et al.  Fast Rectangular Matrix Multiplication and Applications , 1998, J. Complex..

[18]  James Demmel,et al.  Communication optimal parallel multiplication of sparse random matrices , 2013, SPAA.

[19]  James Demmel,et al.  Communication-Avoiding Parallel Strassen: Implementation and performance , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[20]  James Demmel,et al.  Graph expansion and communication costs of fast matrix multiplication: regular submission , 2011, SPAA '11.

[21]  V. Strassen Gaussian elimination is not optimal , 1969 .

[22]  Robert A. van de Geijn,et al.  A flexible class of parallel matrix multiplication algorithms , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[23]  Grazia Lotti,et al.  O(n2.7799) Complexity for n*n Approximate Matrix Multiplication , 1979, Inf. Process. Lett..

[24]  Jaeyoung Choi,et al.  A new parallel matrix multiplication algorithm on distributed‐memory concurrent computers , 1998 .

[25]  Qingshan Luo,et al.  A scalable parallel Strassen's matrix multiplication algorithm for distributed-memory computers , 1995, SAC '95.

[26]  Lynn Elliot Cannon,et al.  A cellular computer to implement the kalman filter algorithm , 1969 .

[27]  Grazia Lotti,et al.  On the Asymptotic Complexity of Rectangular Matrix Multiplication , 1983, Theor. Comput. Sci..

[28]  M. Challacombe A general parallel sparse-blocked matrix multiply for linear scaling SCF theory , 2000 .

[29]  Robert A. van de Geijn,et al.  A High Performance Parallel Strassen Implementation , 1995, Parallel Process. Lett..

[30]  Marc Snir,et al.  GETTING UP TO SPEED THE FUTURE OF SUPERCOMPUTING , 2004 .

[31]  James Demmel,et al.  Communication-optimal Parallel and Sequential Cholesky Decomposition , 2009, SIAM J. Sci. Comput..

[32]  H. Whitney,et al.  An inequality related to the isoperimetric inequality , 1949 .

[33]  James Demmel,et al.  Fast matrix multiplication is stable , 2006, Numerische Mathematik.

[34]  D. Coppersmiths RAPID MULTIPLICATION OF RECTANGULAR MATRICES * , 2014 .

[35]  Alexander Tiskin,et al.  All-Pairs Shortest Paths Computation in the BSP Model , 2001, ICALP.

[36]  James Demmel,et al.  Perfect Strong Scaling Using No Additional Energy , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[37]  Jaeyoung Choi A new parallel matrix multiplication algorithm on distributed-memory concurrent computers , 1998, Concurr. Pract. Exp..

[38]  Katherine A. Yelick,et al.  Communication avoiding and overlapping for numerical linear algebra , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[39]  Don Coppersmith,et al.  Rectangular Matrix Multiplication Revisited , 1997, J. Complex..

[40]  Alexander Tiskin,et al.  Memory-Efficient Matrix Multiplication in the BSP Model , 1999, Algorithmica.

[41]  Nicholas J. Higham,et al.  INVERSE PROBLEMS NEWSLETTER , 1991 .

[42]  Katherine A. Yelick,et al.  A Communication-Optimal N-Body Algorithm for Direct Interactions , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[43]  James Demmel,et al.  Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds , 2012, SPAA '12.

[44]  James Demmel,et al.  Fast linear algebra is stable , 2006, Numerische Mathematik.

[45]  John Shalf,et al.  Exascale Computing Technology Challenges , 2010, VECPAR.

[46]  Kyriakos Kalorkoti ALGEBRAIC COMPLEXITY THEORY (Grundlehren der Mathematischen Wissenschaften 315) , 1999 .

[47]  James Demmel,et al.  Communication efficient gaussian elimination with partial pivoting using a shape morphing data layout , 2013, SPAA.

[48]  John R. Gilbert,et al.  Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments , 2011, SIAM J. Sci. Comput..

[49]  Raphael Yuster,et al.  Fast sparse matrix multiplication , 2004, TALG.

[50]  John R. Gilbert,et al.  Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication , 2008, 2008 37th International Conference on Parallel Processing.

[51]  Barton P. Miller,et al.  Critical path analysis for the execution of parallel and distributed programs , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[52]  L. R. Kerr,et al.  On Minimizing the Number of Multiplications Necessary for Matrix Multiplication , 1969 .

[53]  Hans Werner Meuer,et al.  Top500 Supercomputer Sites , 1997 .

[54]  Eli Upfal,et al.  Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[55]  Larry Rudolph,et al.  Techniques for Parallel Manipulation of Sparse Matrices , 1989, Theor. Comput. Sci..

[56]  James Demmel,et al.  Graph Expansion Analysis for Communication Costs of Fast Rectangular Matrix Multiplication , 2012, MedAlg.

[57]  Ramesh C. Agarwal,et al.  A three-dimensional approach to parallel matrix multiplication , 1995, IBM J. Res. Dev..

[58]  P. Sadayappan,et al.  Communication-Efficient Matrix Multiplication on Hypercubes , 1996, Parallel Comput..

[59]  ToledoSivan,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004 .

[60]  James Demmel,et al.  Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[61]  Julian D. Laderman,et al.  On practical algorithms for accelerated matrix multiplication , 1992 .

[62]  Volker Strassen,et al.  Algebraic Complexity Theory , 1991, Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity.

[63]  Geppino Pucci,et al.  Network-Oblivious Algorithms , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[64]  Charles E. Leiserson,et al.  Cache-Oblivious Algorithms , 2003, CIAC.

[65]  Viktor K. Prasanna,et al.  Optimizing graph algorithms for improved cache performance , 2002, IEEE Transactions on Parallel and Distributed Systems.

[66]  Nedjeljko Frančula The National Academies Press , 2013 .

[67]  David S. Wise,et al.  Seven at one stroke: results from a cache-oblivious paradigm for scalable matrix algorithms , 2006, MSPC '06.

[68]  James Demmel,et al.  Minimizing Communication in All-Pairs Shortest Paths , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[69]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[70]  Samuel H. Fuller,et al.  The Future of Computing Performance: Game Over or Next Level? , 2014 .