Minimizing Communication in Numerical Linear Algebra

In 1981 Hong and Kung proved a lower bound on the amount of communication (amount of data moved between a small, fast memory and large, slow memory) needed to perform dense, n-by-n matrix multiplication using the conventional O(n3) algorithm, where the input matrices were too large to fit in the small, fast memory. In 2004 Irony, Toledo, and Tiskin gave a new proof of this result and extended it to the parallel case (where communication means the amount of data moved between processors). In both cases the lower bound may be expressed as Ω(#arithmetic_operations/M), where M is the size of the fast memory (or local memory in the parallel case). Here we generalize these results to a much wider variety of algorithms, including LU factorization, Cholesky factorization, LDLT factorization, QR factorization, the Gram–Schmidt algorithm, and algorithms for eigenvalues and singular values, i.e., essentially all direct methods of linear algebra. The proof works for dense or sparse matrices and for sequential or para...

[1]  H. Whitney,et al.  An inequality related to the isoperimetric inequality , 1949 .

[2]  Keshav Pingali,et al.  Automatic Generation of Block-Recursive Codes , 2000, Euro-Par.

[3]  ToledoSivan,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004 .

[4]  Raphael Yuster,et al.  Fast sparse matrix multiplication , 2004, TALG.

[5]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[6]  Robert A. van de Geijn,et al.  FLAME: Formal Linear Algebra Methods Environment , 2001, TOMS.

[7]  Inderjit S. Dhillon,et al.  Orthogonal Eigenvectors and Relative Gaps , 2003, SIAM J. Matrix Anal. Appl..

[8]  James Demmel,et al.  Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..

[9]  James Demmel,et al.  Graph expansion and communication costs of fast matrix multiplication: regular submission , 2011, SPAA '11.

[10]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[11]  James Demmel,et al.  Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.

[12]  W. Marsden I and J , 2012 .

[13]  Erik Elmroth,et al.  Applying recursion to serial and parallel QR factorization leads to better performance , 2000, IBM J. Res. Dev..

[14]  C. Loan,et al.  A Storage-Efficient $WY$ Representation for Products of Householder Transformations , 1989 .

[15]  Karen S. Braman,et al.  The Multishift QR Algorithm. Part II: Aggressive Early Deflation , 2001, SIAM J. Matrix Anal. Appl..

[16]  Erik Elmroth,et al.  SIAM REVIEW c ○ 2004 Society for Industrial and Applied Mathematics Vol. 46, No. 1, pp. 3–45 Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software ∗ , 2022 .

[17]  James Demmel,et al.  Communication avoiding Gaussian elimination , 2008, HiPC 2008.

[18]  Matteo Frigo,et al.  Cache-oblivious algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[19]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[20]  Fred G. Gustavson,et al.  Recursion leads to automatic variable blocking for dense linear-algebra algorithms , 1997, IBM J. Res. Dev..

[21]  James Demmel,et al.  LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version , 2012, SIAM J. Matrix Anal. Appl..

[22]  Christian H. Bischof,et al.  The WY representation for products of householder matrices , 1985, PPSC.

[23]  Erik Elmroth,et al.  New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems , 1998, PARA.

[24]  Inderjit S. Dhillon,et al.  The design and implementation of the MRRR algorithm , 2006, TOMS.

[25]  J. Bunch,et al.  Some stable methods for calculating inertia and solving symmetric linear systems , 1977 .

[26]  C. Puglisi Modification of the householder method based on the compact WY representation , 1992 .

[27]  Viktor K. Prasanna,et al.  Optimizing graph algorithms for improved cache performance , 2004, Proceedings 16th International Parallel and Distributed Processing Symposium.

[28]  Vijaya Ramachandran,et al.  Cache-oblivious dynamic programming , 2006, SODA '06.

[29]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[30]  Fred G. Gustavson,et al.  A recursive formulation of Cholesky factorization of a matrix in packed storage , 2001, TOMS.

[31]  Alexander Tiskin,et al.  Memory-Efficient Matrix Multiplication in the BSP Model , 1999, Algorithmica.

[32]  Christian H. Bischof,et al.  A framework for symmetric band reduction , 2000, TOMS.

[33]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[34]  Sivan Toledo Locality of Reference in LU Decomposition with Partial Pivoting , 1997, SIAM J. Matrix Anal. Appl..

[35]  Robert A. van de Geijn,et al.  Parallel out-of-core computation and updating of the QR factorization , 2005, TOMS.

[36]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[37]  Sartaj Sahni,et al.  Parallel Matrix and Graph Algorithms , 1981, SIAM J. Comput..

[38]  Viktor K. Prasanna,et al.  Optimizing graph algorithms for improved cache performance , 2002, IEEE Transactions on Parallel and Distributed Systems.

[39]  William Gropp,et al.  Hybrid Static/dynamic Scheduling for Already Optimized Dense Matrix Factorization , 2011, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[40]  Cleve Ashcraft,et al.  The Fan-Both Family of Column-Based Distributed Cholesky Factorization Algorithms , 1993 .

[41]  Ramesh C. Agarwal,et al.  A three-dimensional approach to parallel matrix multiplication , 1995, IBM J. Res. Dev..

[42]  Jack Dongarra,et al.  ScaLAPACK Users' Guide , 1987 .

[43]  D. Rose,et al.  Complexity Bounds for Regular Finite Difference and Finite Element Grids , 1973 .

[44]  Dror Irony,et al.  TRADING REPLICATION FOR COMMUNICATION IN PARALLEL DISTRIBUTED-MEMORY DENSE SOLVERS , 2002 .

[45]  Jack J. Dongarra,et al.  Basic Linear Algebra Subprograms Technical (Blast) Forum Standard (1) , 2002, Int. J. High Perform. Comput. Appl..

[46]  John E. Savage Extending the Hong-Kung Model to Memory Hierarchies , 1995, COCOON.

[47]  James Demmel,et al.  Applied Numerical Linear Algebra , 1997 .

[48]  G. Golub,et al.  Parallel block schemes for large-scale least-squares computations , 1988 .

[49]  Laura Grigori,et al.  Adapting communication-avoiding LU and QR factorizations to multicore architectures , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[50]  A. George Nested Dissection of a Regular Finite Element Mesh , 1973 .

[51]  Y. Saad,et al.  Communication complexity of the Gaussian elimination algorithm on multiprocessors , 1986 .

[52]  James Demmel,et al.  Brief announcement: Lower bounds on communication for sparse Cholesky factorization of a model problem , 2010, SPAA '10.

[53]  Christian H. Bischof,et al.  Algorithm 807: The SBR Toolbox—software for successive band reduction , 2000, TOMS.

[54]  Cleve Ashc Raft The fan-both family of column-based distributed Cholesky factorization algorithms , 1993 .

[55]  Gene H. Golub,et al.  Matrix computations , 1983 .

[56]  Karen S. Braman,et al.  The Multishift QR Algorithm. Part I: Maintaining Well-Focused Shifts and Level 3 Performance , 2001, SIAM J. Matrix Anal. Appl..

[57]  DemmelJames,et al.  Graph expansion and communication costs of fast matrix multiplication , 2013 .

[58]  Jack Dongarra,et al.  Preface: Basic Linear Algebra Subprograms Technical (Blast) Forum Standard , 2002 .

[59]  Michael A. Bender,et al.  Optimal Sparse Matrix Dense Vector Multiplication in the I/O-Model , 2007, SPAA '07.

[60]  R. Tarjan,et al.  The analysis of a nested dissection algorithm , 1987 .

[61]  Jeremy D. Frens,et al.  QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism , 2003, PPoPP '03.

[62]  James Demmel,et al.  Minimizing Communication in Linear Algebra , 2009, ArXiv.

[63]  Jack Dongarra,et al.  Basic Linear Algebra Subprograms (BLAS) , 2011, Encyclopedia of Parallel Computing.

[64]  James Demmel,et al.  CALU: A Communication Optimal LU Factorization Algorithm , 2011, SIAM J. Matrix Anal. Appl..

[65]  Dror Irony,et al.  Trading Replication for Communication in Parallel Distributed-Memory Dense Solvers , 2002, Parallel Process. Lett..

[66]  Jack Dongarra,et al.  LAPACK's user's guide , 1992 .

[67]  James Demmel,et al.  Fast linear algebra is stable , 2006, Numerische Mathematik.

[68]  J. Demmel,et al.  Implementing Communication-Optimal Parallel and Sequential QR Factorizations , 2008, 0809.2407.

[69]  James Demmel,et al.  Graph Expansion and Communication Costs of Algorithms , 2010 .

[70]  Robert A. van de Geijn,et al.  PLAPACK: Parallel Linear Algebra Package , 1997, PPSC.

[71]  Lynn Elliot Cannon,et al.  A cellular computer to implement the kalman filter algorithm , 1969 .

[72]  Thomas H. Cormen,et al.  Introduction to algorithms [2nd ed.] , 2001 .

[73]  James Demmel,et al.  Communication-optimal Parallel and Sequential Cholesky Decomposition , 2009, SIAM J. Sci. Comput..