论文信息 - Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication

Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication

Communication-optimal algorithms are known for square matrix multiplication. Here, we obtain the first communication-optimal algorithm for all dimensions of rectangular matrices. Combining the dimension-splitting technique of Frigo, Leiserson, Prokop and Ramachandran (1999) with the recursive BFS/DFS approach of Ballard, Demmel, Holtz, Lipshitz and Schwartz (2012) allows for a communication-optimal as well as cache and network-oblivious algorithm. Moreover, the implementation is simple: approximately 50 lines of code for the shared-memory version. Since the new algorithm minimizes communication across the network, between NUMA domains, and between levels of cache, it performs well in practice on both shared and distributed-memory machines. We show significant speedups over existing parallel linear algebra libraries both on a 32-core shared-memory machine and on a distributed-memory supercomputer.

[1] Ramesh C. Agarwal,et al. A three-dimensional approach to parallel matrix multiplication , 1995, IBM J. Res. Dev..

[2] Robert A. van de Geijn,et al. SUMMA: Scalable Universal Matrix Multiplication Algorithm , 1995 .

[3] Dror Irony,et al. Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[4] H. Whitney,et al. An inequality related to the isoperimetric inequality , 1949 .

[5] Gang Ren,et al. Is Search Really Necessary to Generate High-Performance BLAS? , 2005, Proceedings of the IEEE.

[6] James Demmel,et al. Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..

[7] Vijaya Ramachandran,et al. Oblivious algorithms for multicores and network of processors , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[8] James Demmel,et al. Communication-optimal parallel algorithm for strassen's matrix multiplication , 2012, SPAA '12.

[9] James Demmel,et al. Communication-Avoiding Parallel Strassen: Implementation and performance , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[10] Don Coppersmith,et al. Rectangular Matrix Multiplication Revisited , 1997, J. Complex..

[11] Volker Strassen,et al. Algebraic Complexity Theory , 1991, Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity.

[12] Alexander Tiskin,et al. Memory-Efficient Matrix Multiplication in the BSP Model , 1999, Algorithmica.

[13] John Shalf,et al. SEJITS: Getting Productivity and Performance With Selective Embedded JIT Specialization , 2010 .

[14] Michael Clausen,et al. Algebraic complexity theory , 1997, Grundlehren der mathematischen Wissenschaften.

[15] Victor Y. Pan,et al. Fast Rectangular Matrix Multiplication and Applications , 1998, J. Complex..

[16] Geppino Pucci,et al. Network-Oblivious Algorithms , 2007, IPDPS.

[17] H. T. Kung,et al. I/O complexity: The red-blue pebble game , 1981, STOC '81.

[18] Grazia Lotti,et al. O(n2.7799) Complexity for n*n Approximate Matrix Multiplication , 1979, Inf. Process. Lett..

[19] Erik Elmroth,et al. SIAM REVIEW c ○ 2004 Society for Industrial and Applied Mathematics Vol. 46, No. 1, pp. 3–45 Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software ∗ , 2022 .

[20] James Demmel,et al. Improving communication performance in dense linear algebra via topology aware collectives , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[21] Jack J. Dongarra,et al. A Portable Programming Interface for Performance Evaluation on Modern Processors , 2000, Int. J. High Perform. Comput. Appl..

[22] Keshav Pingali,et al. An experimental comparison of cache-oblivious and cache-conscious programs , 2007, SPAA '07.

[23] D. Coppersmiths. RAPID MULTIPLICATION OF RECTANGULAR MATRICES * , 2014 .

[24] Robert A. van de Geijn,et al. SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[25] L. R. Kerr,et al. On Minimizing the Number of Multiplications Necessary for Matrix Multiplication , 1969 .

[26] Grazia Lotti,et al. On the Asymptotic Complexity of Rectangular Matrix Multiplication , 1983, Theor. Comput. Sci..

[27] Matteo Frigo,et al. Cache-oblivious algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[28] P. Sadayappan,et al. Communication-Efficient Matrix Multiplication on Hypercubes , 1996, Parallel Comput..

[29] James Demmel,et al. Graph Expansion Analysis for Communication Costs of Fast Rectangular Matrix Multiplication , 2012, MedAlg.

[30] V. Strassen. Gaussian elimination is not optimal , 1969 .

[31] Victor Y. Pan,et al. Fast rectangular matrix multiplications and improving parallel matrix computations , 1997, PASCO '97.

[32] James Demmel,et al. Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.

[33] James Demmel,et al. Matrix Multiplication on Multidimensional Torus Networks , 2012, VECPAR.

[34] James Demmel,et al. Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds , 2012, SPAA '12.