Chapter 9 Communication Avoiding ( CA ) and Other Innovative Algorithms

In 1981 Hong and Kung proved a lower bound on the amount of communication (amount of data moved between a small, fast memory and large, slow memory) needed to perform dense, n-by-n matrix-multiplication using the conventional O(n) algorithm, where the input matrices were too large to fit in the small, fast memory. In 2004 Irony, Toledo and Tiskin gave a new proof of this result and extended it to the parallel case (where communication means the amount of data moved between processors). In both cases the lower bound may be expressed as Ω(#arithmetic operations / √ M ), where M is the size of the fast memory (or local memory in the parallel case). Here we generalize these results to a much wider variety of algorithms, including LU factorization, Cholesky factorization, LDL factorization, QR factorization, Gram–Schmidt algorithm, algorithms for eigenvalues and singular values, i.e., essentially all direct methods of linear algebra. The proof works for dense or sparse matrices, and for sequential or parallel algorithms. In addition to lower bounds on the amount of data moved (bandwidth-cost), we get lower bounds on the number of messages required to move it (latency-cost). We extend our lower bound technique to compositions of linear algebra operations (like computing powers of a matrix), to decide whether it is enough to call a sequence of simpler optimal algorithms (like matrix multiplication) to minimize communication, or if we can do better. We give examples of both. We also show how to extend our lower bounds to certain graph theoretic problems. We point out recently designed algorithms that attain many of these lower bounds.

[1]  Sartaj Sahni,et al.  Parallel Matrix and Graph Algorithms , 1981, SIAM J. Comput..

[2]  Y. Saad,et al.  GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems , 1986 .

[3]  H. Walker Implementation of the GMRES method using householder transformations , 1988 .

[4]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[5]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[6]  D. Hut A Newton Basis Gmres Implementation , 1991 .

[7]  James Demmel,et al.  Applied Numerical Linear Algebra , 1997 .

[8]  Erik Elmroth,et al.  Applying recursion to serial and parallel QR factorization leads to better performance , 2000, IBM J. Res. Dev..

[9]  James Demmel,et al.  CALU: A Communication Optimal LU Factorization Algorithm , 2011, SIAM J. Matrix Anal. Appl..

[10]  Jocelyne Erhel,et al.  A parallel GMRES version for general sparse matrices. , 1995 .

[11]  James Demmel,et al.  Minimizing communication in sparse matrix solvers , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[12]  Jeremy D. Frens,et al.  QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism , 2003, PPoPP '03.

[13]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[14]  A. George Nested Dissection of a Regular Finite Element Mesh , 1973 .

[15]  D. Rose,et al.  Complexity Bounds for Regular Finite Difference and Finite Element Grids , 1973 .

[16]  Katherine Yelick,et al.  Optimizing collective communication on multicores , 2009 .

[17]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[18]  James Demmel,et al.  Communication-optimal Parallel and Sequential Cholesky Decomposition , 2009, SIAM J. Sci. Comput..

[19]  James Demmel,et al.  Avoiding communication in sparse matrix computations , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[20]  James Demmel,et al.  LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version , 2012, SIAM J. Matrix Anal. Appl..

[21]  James Demmel,et al.  Brief announcement: Lower bounds on communication for sparse Cholesky factorization of a model problem , 2010, SPAA '10.

[22]  Robert A. van de Geijn,et al.  FLAME: Formal Linear Algebra Methods Environment , 2001, TOMS.

[23]  Christian H. Bischof,et al.  A framework for symmetric band reduction , 2000, TOMS.

[24]  Karen S. Braman,et al.  The Multishift QR Algorithm. Part II: Aggressive Early Deflation , 2001, SIAM J. Matrix Anal. Appl..

[25]  James Demmel,et al.  Exploiting Data Sparsity in Parallel Matrix Powers Computations , 2013, PPAM.

[26]  Charles E. Leiserson,et al.  Cache-Oblivious Algorithms , 2003, CIAC.

[27]  James Demmel,et al.  Minimizing Communication in Linear Algebra , 2009, ArXiv.

[28]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[29]  Keshav Pingali,et al.  Automatic Generation of Block-Recursive Codes , 2000, Euro-Par.

[30]  C. Puglisi Modification of the householder method based on the compact WY representation , 1992 .

[31]  G. Golub,et al.  Parallel block schemes for large-scale least-squares computations , 1988 .

[32]  Robert A. van de Geijn,et al.  Parallel out-of-core computation and updating of the QR factorization , 2005, TOMS.

[33]  Fred G. Gustavson,et al.  Recursion leads to automatic variable blocking for dense linear-algebra algorithms , 1997, IBM J. Res. Dev..

[34]  E. Sturler A PARALLEL VARIANT OF GMRES(m) , 1991 .

[35]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[36]  James Demmel,et al.  Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.

[37]  Jack Dongarra,et al.  LAPACK's user's guide , 1992 .

[38]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[39]  Raphael Yuster,et al.  Fast sparse matrix multiplication , 2004, TALG.

[40]  Ronald B. Morgan,et al.  Implicitly Restarted GMRES and Arnoldi Methods for Nonsymmetric Systems of Equations , 2000, SIAM J. Matrix Anal. Appl..

[41]  C. Loan,et al.  A Storage-Efficient $WY$ Representation for Products of Householder Transformations , 1989 .

[42]  James Demmel,et al.  Perfect Strong Scaling Using No Additional Energy , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[43]  Dror Irony,et al.  Trading Replication for Communication in Parallel Distributed-Memory Dense Solvers , 2002, Parallel Process. Lett..

[44]  James Demmel,et al.  Communication-Avoiding Parallel Strassen: Implementation and performance , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[45]  Sivan Toledo Locality of Reference in LU Decomposition with Partial Pivoting , 1997 .

[46]  Richard W. Vuduc,et al.  Sparsity: Optimization Framework for Sparse Matrix Kernels , 2004, Int. J. High Perform. Comput. Appl..

[47]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[48]  Erik Elmroth,et al.  SIAM REVIEW c ○ 2004 Society for Industrial and Applied Mathematics Vol. 46, No. 1, pp. 3–45 Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software ∗ , 2022 .

[49]  David A. Patterson,et al.  Direction-optimizing Breadth-First Search , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[50]  J. Demmel,et al.  Avoiding Communication in Computing Krylov Subspaces , 2007 .

[51]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[52]  James Demmel,et al.  A Residual Replacement Strategy for Improving the Maximum Attainable Accuracy of s-Step Krylov Subspace Methods , 2014, SIAM J. Matrix Anal. Appl..

[53]  Robert A. van de Geijn,et al.  PLAPACK: Parallel Linear Algebra Package , 1997, PPSC.

[54]  Richard Vuduc,et al.  Automatic performance tuning of sparse matrix kernels , 2003 .

[55]  James Demmel,et al.  Fast linear algebra is stable , 2006, Numerische Mathematik.

[56]  Christian H. Bischof,et al.  The WY representation for products of householder matrices , 1985, PPSC.

[57]  Fred G. Gustavson,et al.  A recursive formulation of Cholesky factorization of a matrix in packed storage , 2001, TOMS.

[58]  James Demmel,et al.  Nonnegative Diagonals and High Performance on Low-Profile Matrices from Householder QR , 2009, SIAM J. Sci. Comput..

[59]  Marc Snir,et al.  GETTING UP TO SPEED THE FUTURE OF SUPERCOMPUTING , 2004 .

[60]  H. Whitney,et al.  An inequality related to the isoperimetric inequality , 1949 .

[61]  James Demmel,et al.  Avoiding Communication in Two-Sided Krylov Subspace Methods , 2011 .

[62]  James Demmel,et al.  Communication Avoiding Gaussian elimination , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[63]  W. Joubert,et al.  Parallelizable restarted iterative methods for nonsymmetric linear systems. part I: Theory , 1992 .

[64]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[65]  James Demmel,et al.  Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[66]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[67]  James Demmel,et al.  Communication avoiding successive band reduction , 2012, PPoPP '12.

[68]  Mark Hoemmen,et al.  A Communication-Avoiding, Hybrid-Parallel, Rank-Revealing Orthogonalization Method , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[69]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[70]  Samuel H. Fuller,et al.  The Future of Computing Performance: Game Over or Next Level? , 2014 .

[71]  Y. Saad,et al.  Communication complexity of the Gaussian elimination algorithm on multiprocessors , 1986 .

[72]  J. Cullum,et al.  A block Lanczos algorithm for computing the q algebraically largest eigenvalues and a corresponding eigenspace of large, sparse, real symmetric matrices , 1974, CDC 1974.

[73]  Cleve Ashc Raft The fan-both family of column-based distributed Cholesky factorization algorithms , 1993 .

[74]  Vijaya Ramachandran,et al.  Cache-oblivious dynamic programming , 2006, SODA '06.

[75]  Sivan Toledo,et al.  Quantitative performance modeling of scientific computations and creating locality in numerical algorithms , 1995 .

[76]  Samuel Williams,et al.  Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[77]  James Demmel,et al.  Communication-Avoiding Krylov Techniques on Graphic Processing Units , 2013, IEEE Transactions on Magnetics.

[78]  James Demmel,et al.  Avoiding Communication in Successive Band Reduction , 2015, ACM Trans. Parallel Comput..

[79]  Michael A. Bender,et al.  Optimal Sparse Matrix Dense Vector Multiplication in the I/O-Model , 2007, SPAA '07.

[80]  John E. Savage Extending the Hong-Kung Model to Memory Hierarchies , 1995, COCOON.

[81]  Karen S. Braman,et al.  The Multishift QR Algorithm. Part I: Maintaining Well-Focused Shifts and Level 3 Performance , 2001, SIAM J. Matrix Anal. Appl..

[82]  James Demmel,et al.  Implementing a Blocked Aasen's Algorithm with a Dynamic Scheduler on Multicore Architectures , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[83]  Christian H. Bischof,et al.  Algorithm 807: The SBR Toolbox—software for successive band reduction , 2000, TOMS.

[84]  D. Patterson,et al.  Searching for a Parent Instead of Fighting Over Children : A Fast Breadth-First Search Implementation for Graph 500 , 2011 .

[85]  R. Tarjan,et al.  The analysis of a nested dissection algorithm , 1987 .

[86]  Mark Hoemmen,et al.  Communication-avoiding Krylov subspace methods , 2010 .

[87]  Erik Elmroth,et al.  New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems , 1998, PARA.

[88]  Katherine A. Yelick,et al.  A Communication-Optimal N-Body Algorithm for Direct Interactions , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[89]  J. Bunch,et al.  Some stable methods for calculating inertia and solving symmetric linear systems , 1977 .

[90]  Viktor K. Prasanna,et al.  Optimizing graph algorithms for improved cache performance , 2002, IEEE Transactions on Parallel and Distributed Systems.

[91]  Ramesh C. Agarwal,et al.  A three-dimensional approach to parallel matrix multiplication , 1995, IBM J. Res. Dev..

[92]  James Demmel,et al.  Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..

[93]  Lynn Elliot Cannon,et al.  A cellular computer to implement the kalman filter algorithm , 1969 .

[94]  James Demmel,et al.  Communication efficient gaussian elimination with partial pivoting using a shape morphing data layout , 2013, SPAA.

[95]  W. Marsden I and J , 2012 .

[96]  James Demmel,et al.  Communication lower bounds and optimal algorithms for programs that reference arrays - Part 1 , 2013, ArXiv.

[97]  Alexander Tiskin,et al.  Memory-Efficient Matrix Multiplication in the BSP Model , 1999, Algorithmica.

[98]  Eric de Sturler,et al.  Recycling Krylov Subspaces for Sequences of Linear Systems , 2006, SIAM J. Sci. Comput..

[99]  D. O’Leary The block conjugate gradient algorithm and related methods , 1980 .