Graph expansion and communication costs of fast matrix multiplication: regular submission

The communication cost of algorithms (also known as I/O-complexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen's and other fast matrix multiplication algorithms, and obtain the first lower bounds on their communication costs. For sequential algorithms these bounds are attainable and so optimal.

[1]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[2]  Grazia Lotti,et al.  O(n2.7799) Complexity for n*n Approximate Matrix Multiplication , 1979, Inf. Process. Lett..

[3]  Alok Aggarwal,et al.  Communication Complexity of PRAMs , 1990, Theor. Comput. Sci..

[4]  Frédéric Suter,et al.  Impact of mixed‐parallelism on parallel implementations of the Strassen and Winograd matrix multiplication algorithms , 2004, Concurr. Pract. Exp..

[5]  Alexander Tiskin Communication-efficient parallel generic pairwise elimination , 2007, Future Gener. Comput. Syst..

[6]  A. J. Stothers On the complexity of matrix multiplication , 2010 .

[7]  Dan Suciu,et al.  Journal of the ACM , 2006 .

[8]  Volker Strassen,et al.  Algebraic Complexity Theory , 1991, Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity.

[9]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[10]  J. R. Johnson,et al.  Implementation of Strassen's Algorithm for Matrix Multiplication , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[11]  Marc Snir,et al.  GETTING UP TO SPEED THE FUTURE OF SUPERCOMPUTING , 2004 .

[12]  Telecommunications Board The Future of Computing Performance: Game Over or Next Level? , 2011 .

[13]  Michael Clausen,et al.  Algebraic complexity theory , 1997, Grundlehren der mathematischen Wissenschaften.

[14]  Victor Y. Pan,et al.  New Fast Algorithms for Matrix Operations , 1980, SIAM J. Comput..

[15]  Avi Wigderson,et al.  Entropy waves, the zig-zag graph product, and new constant-degree expanders and extractors , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[16]  Erik Elmroth,et al.  New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems , 1998, PARA.

[17]  Franco P. Preparata,et al.  Processor—Time Tradeoffs under Bounded-Speed Message Propagation: Part II, Lower Bounds , 1999, Theory of Computing Systems.

[18]  Christopher Umans Group-theoretic algorithms for matrix multiplication , 2005, 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05).

[19]  Lynn Elliot Cannon,et al.  A cellular computer to implement the kalman filter algorithm , 1969 .

[20]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[21]  Michael A. Bender,et al.  Optimal Sparse Matrix Dense Vector Multiplication in the I/O-Model , 2007, SPAA '07.

[22]  Shmuel Winograd,et al.  On multiplication of 2 × 2 matrices , 1971 .

[23]  James Demmel,et al.  Communication-optimal parallel algorithm for strassen's matrix multiplication , 2012, SPAA '12.

[24]  James Demmel,et al.  Communication Avoiding Gaussian elimination , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[25]  James Demmel,et al.  Brief announcement: communication bounds for heterogeneous architectures , 2011, SPAA '11.

[26]  James Demmel,et al.  Communication-optimal Parallel and Sequential Cholesky Decomposition , 2009, SIAM J. Sci. Comput..

[27]  James Demmel,et al.  Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds , 2012, SPAA '12.

[28]  James Demmel,et al.  Fast linear algebra is stable , 2006, Numerische Mathematik.

[29]  Viktor K. Prasanna,et al.  Optimizing graph algorithms for improved cache performance , 2004, Proceedings 16th International Parallel and Distributed Processing Symposium.

[30]  Vijaya Ramachandran,et al.  Cache-oblivious dynamic programming , 2006, SODA '06.

[31]  Don Coppersmith,et al.  Matrix multiplication via arithmetic progressions , 1987, STOC.

[32]  Patricia J. Teller,et al.  Proceedings of the 2008 ACM/IEEE conference on Supercomputing , 2008, HiPC 2008.

[33]  James Demmel,et al.  Graph Expansion Analysis for Communication Costs of Fast Rectangular Matrix Multiplication , 2012, MedAlg.

[34]  Frédéric Suter,et al.  Impact of mixed-parallelism on parallel implementations of the Strassen and Winograd matrix multiplication algorithms: Research Articles , 2004 .

[35]  James Demmel,et al.  Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..

[36]  John E. Savage Space-Time Tradeoffs in Memory Hierarchies , 1994 .

[37]  Sartaj Sahni,et al.  Parallel Matrix and Graph Algorithms , 1981, SIAM J. Comput..

[38]  Andrea Pietracaprina,et al.  On the Space and Access Complexity of Computation DAGs , 2000, WG.

[39]  Ramesh C. Agarwal,et al.  A three-dimensional approach to parallel matrix multiplication , 1995, IBM J. Res. Dev..

[40]  F. P. Preparata,et al.  Processor—Time Tradeoffs under Bounded-Speed Message Propagation: Part I, Upper Bounds , 1995, Theory of Computing Systems.

[41]  A. Tiskin Bulk-Synchronous Parallel Gaussian Elimination , 2002 .

[42]  Michael A. Heroux,et al.  GEMMW: A Portable Level 3 BLAS Winograd Variant of Strassen's Matrix-Matrix Multiply Algorithm , 1994, Journal of Computational Physics.

[43]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[44]  Milena Mihail,et al.  Conductance and convergence of Markov chains-a combinatorial treatment of expanders , 1989, 30th Annual Symposium on Foundations of Computer Science.

[45]  John E. Savage Extending the Hong-Kung Model to Memory Hierarchies , 1995, COCOON.

[46]  David S. Wise Ahnentafel Indexing into Morton-Ordered Arrays, or Matrix Locality for Free , 2000, Euro-Par.

[47]  Viktor K. Prasanna,et al.  Optimizing graph algorithms for improved cache performance , 2002, IEEE Transactions on Parallel and Distributed Systems.

[48]  Francesco Romani,et al.  Some Properties of Disjoint Sums of Tensors Related to Matrix Multiplication , 1982, SIAM J. Comput..

[49]  Jack Dongarra,et al.  LAPACK's user's guide , 1992 .

[50]  Fred G. Gustavson,et al.  Recursion leads to automatic variable blocking for dense linear-algebra algorithms , 1997, IBM J. Res. Dev..

[51]  Noga Alon,et al.  An elementary construction of constant-degree expanders , 2007, SODA '07.

[52]  Leslie G. Valiant,et al.  Size Bounds for Superconcentrators , 1983, Theor. Comput. Sci..

[53]  Barton P. Miller,et al.  Critical path analysis for the execution of parallel and distributed programs , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[54]  L. R. Kerr,et al.  On Minimizing the Number of Multiplications Necessary for Matrix Multiplication , 1969 .

[55]  J. Demmel,et al.  Sequential Communication Bounds for Fast Linear Algebra , 2012 .

[56]  Alexander Tiskin,et al.  Memory-Efficient Matrix Multiplication in the BSP Model , 1999, Algorithmica.

[57]  James Demmel,et al.  Brief announcement: Lower bounds on communication for sparse Cholesky factorization of a model problem , 2010, SPAA '10.

[58]  S. Winograd,et al.  On the asymptotic complexity of matrix multiplication , 1981, 22nd Annual Symposium on Foundations of Computer Science (sfcs 1981).

[59]  Jack Dongarra,et al.  ScaLAPACK Users' Guide , 1987 .

[60]  Jeremy D. Frens,et al.  QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism , 2003, PPoPP '03.

[61]  James Demmel,et al.  Minimizing Communication in Linear Algebra , 2009, ArXiv.

[62]  Dario Bini Relations between exact and approximate bilinear algorithms. Applications , 1980 .

[63]  H. Whitney,et al.  An inequality related to the isoperimetric inequality , 1949 .

[64]  Don Coppersmith,et al.  Rectangular Matrix Multiplication Revisited , 1997, J. Complex..

[65]  Arnold Schönhage,et al.  Partial and Total Matrix Multiplication , 1981, SIAM J. Comput..

[66]  V. Strassen Gaussian elimination is not optimal , 1969 .

[67]  Keshav Pingali,et al.  Automatic Generation of Block-Recursive Codes , 2000, Euro-Par.

[68]  Virginia Vassilevska Williams,et al.  Multiplying matrices faster than coppersmith-winograd , 2012, STOC '12.

[69]  V. Strassen Relative bilinear complexity and matrix multiplication. , 1987 .

[70]  James Demmel,et al.  Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.

[71]  David H. Bailey,et al.  Extra high speed matrix multiplication on the Cray-2 , 1988 .

[72]  Michael A. Bender,et al.  Optimal sparse matrix dense vector multiplication in the I/O-model , 2007, SPAA.

[73]  V. Rich Personal communication , 1989, Nature.

[74]  James Demmel,et al.  CALU: A Communication Optimal LU Factorization Algorithm , 2011, SIAM J. Matrix Anal. Appl..

[75]  James Demmel,et al.  Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..

[76]  James Demmel,et al.  Fast matrix multiplication is stable , 2006, Numerische Mathematik.

[77]  Sivan Toledo Locality of Reference in LU Decomposition with Partial Pivoting , 1997, SIAM J. Matrix Anal. Appl..

[78]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[79]  Guy E. Blelloch,et al.  Provably good multicore cache performance for divide-and-conquer algorithms , 2008, SODA '08.

[80]  Matteo Frigo,et al.  Cache-oblivious algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[81]  Ran Raz,et al.  On the complexity of matrix product , 2002, STOC '02.

[82]  A. Wigderson,et al.  ENTROPY WAVES, THE ZIG-ZAG GRAPH PRODUCT, AND NEW CONSTANT-DEGREE , 2004, math/0406038.