Minimizing Communication in Linear Algebra

Algorithms have two kinds of costs: arithmetic and communication, by which we mean moving data either between levels of a memory hierarchy (in the sequential case) or between processors over a network (in the parallel case). Communication costs can already exceed arithmetic costs by orders of magnitude, and the gap is growing exponentially over time, so our goal is to design linear algebra algorithms that minimize communication. First, we show how to extend known communication lower bounds for O(n) dense matrix multiplication to all direct linear algebra, i.e. for solving linear systems, least squares problems, eigenproblems and the SVD, for dense or sparse matrices, and for sequential or parallel machines. We also describe new algorithms that attain these lower bounds; some implementations attain large speed ups over conventional algorithms. Second, we show how to minimize communication in Krylov-subspace methods for solving sparse linear system and eigenproblems, and again demonstrate new algorithms with significant speedups. Monday December 13 2010 4:30 PM Building 2, Room 105 Refreshments are available in Building 2, Room 290 (Math Common Room) between 3:30 – 4:30 PM Applied Math Colloquium: http://www-math.mit.edu/amc/fall10 Mathematics Department: http://www-math.mit.edu To sign up for Applied Mathematics Colloquium announcements, please contact avisha@math.mit.edu Massachusetts Institute of Technology Department of Mathematics Cambridge, MA 02139

[1]  Robert A. van de Geijn,et al.  FLAME: Formal Linear Algebra Methods Environment , 2001, TOMS.

[2]  Robert A. van de Geijn,et al.  PLAPACK: Parallel Linear Algebra Package , 1997, PPSC.

[3]  James Demmel,et al.  Fast linear algebra is stable , 2006, Numerische Mathematik.

[4]  H. Whitney,et al.  An inequality related to the isoperimetric inequality , 1949 .

[5]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[6]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[7]  Christian H. Bischof,et al.  The WY representation for products of householder matrices , 1985, PPSC.

[8]  Lynn Elliot Cannon,et al.  A cellular computer to implement the kalman filter algorithm , 1969 .

[9]  Jack Dongarra,et al.  LAPACK's user's guide , 1992 .

[10]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[11]  C. Puglisi Modification of the householder method based on the compact WY representation , 1992 .

[12]  Raphael Yuster,et al.  Fast sparse matrix multiplication , 2004, TALG.

[13]  Karen S. Braman,et al.  The Multishift QR Algorithm. Part II: Aggressive Early Deflation , 2001, SIAM J. Matrix Anal. Appl..

[14]  G. Golub,et al.  Parallel block schemes for large-scale least-squares computations , 1988 .

[15]  James Demmel,et al.  Communication-optimal Parallel and Sequential Cholesky Decomposition , 2009, SIAM J. Sci. Comput..

[16]  Keshav Pingali,et al.  Automatic Generation of Block-Recursive Codes , 2000, Euro-Par.

[17]  Karen S. Braman,et al.  The Multishift QR Algorithm. Part I: Maintaining Well-Focused Shifts and Level 3 Performance , 2001, SIAM J. Matrix Anal. Appl..

[18]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[19]  Inderjit S. Dhillon,et al.  Orthogonal Eigenvectors and Relative Gaps , 2003, SIAM J. Matrix Anal. Appl..

[20]  Fred G. Gustavson,et al.  A recursive formulation of Cholesky factorization of a matrix in packed storage , 2001, TOMS.

[21]  C. Loan,et al.  A Storage-Efficient $WY$ Representation for Products of Householder Transformations , 1989 .

[22]  Erik Elmroth,et al.  SIAM REVIEW c ○ 2004 Society for Industrial and Applied Mathematics Vol. 46, No. 1, pp. 3–45 Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software ∗ , 2022 .

[23]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[24]  Jack J. Dongarra,et al.  Basic Linear Algebra Subprograms Technical (Blast) Forum Standard (1) , 2002, Int. J. High Perform. Comput. Appl..

[25]  John E. Savage Extending the Hong-Kung Model to Memory Hierarchies , 1995, COCOON.

[26]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[27]  Michael A. Bender,et al.  Optimal Sparse Matrix Dense Vector Multiplication in the I/O-Model , 2007, SPAA '07.

[28]  James Demmel,et al.  Communication Avoiding Gaussian elimination , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[29]  R. Tarjan,et al.  The analysis of a nested dissection algorithm , 1987 .

[30]  Erik Elmroth,et al.  New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems , 1998, PARA.

[31]  James Demmel,et al.  Applied Numerical Linear Algebra , 1997 .

[32]  A. George Nested Dissection of a Regular Finite Element Mesh , 1973 .

[33]  Erik Elmroth,et al.  Applying recursion to serial and parallel QR factorization leads to better performance , 2000, IBM J. Res. Dev..

[34]  Charles E. Leiserson,et al.  Cache-Oblivious Algorithms , 2003, CIAC.

[35]  Viktor K. Prasanna,et al.  Optimizing graph algorithms for improved cache performance , 2002, IEEE Transactions on Parallel and Distributed Systems.

[36]  Vijaya Ramachandran,et al.  Cache-oblivious dynamic programming , 2006, SODA '06.

[37]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[38]  Inderjit S. Dhillon,et al.  The design and implementation of the MRRR algorithm , 2006, TOMS.

[39]  Sivan Toledo Locality of Reference in LU Decomposition with Partial Pivoting , 1997, SIAM J. Matrix Anal. Appl..

[40]  Robert A. van de Geijn,et al.  Parallel out-of-core computation and updating of the QR factorization , 2005, TOMS.

[41]  D. Rose,et al.  Complexity Bounds for Regular Finite Difference and Finite Element Grids , 1973 .