Avoiding Communication in Dense Linear Algebra

Author(s): Ballard, Grey | Advisor(s): Demmel, James W | Abstract: Dense linear algebra computations are essential to nearly every problem in scientific computing and to countless other fields. Most matrix computations enjoy a high computational intensity (i.e., ratio of computation to data), and therefore the algorithms for the computations have a potential for high efficiency. However, performance for many linear algebra algorithms is limited by the cost of moving data between processors on a parallel computer or throughout the memory hierarchy of a single processor, which we will refer to generally as communication. Technological trends indicate that algorithmic performance will become even more limited by communication in the future. In this thesis, we consider the fundamental computations within dense linear algebra and address the following question: can we significantly improve the current algorithms for these computations, in terms of the communication they require and their performance in practice?To answer the question, we analyze algorithms on sequential and parallel architectural models that are simple enough to determine coarse communication costs but accurate enough to predict performance of implementations on real hardware. For most of the computations, we prove lower bounds on the communication that any algorithm must perform. If an algorithm exists with communication costs that match the lower bounds (at least in an asymptotic sense), we call the algorithm communication optimal. In many cases, the most commonly used algorithms are not communication optimal, and we can develop new algorithms that require less data movement and attain the communication lower bounds.In this thesis, we develop both new communication lower bounds and new algorithms, tightening (and in many cases closing) the gap between best known lower bound and best known algorithm (or upper bound). We consider both sequential and parallel algorithms, and we asses both classical and fast algorithms (e.g., Strassen's matrix multiplication algorithm).In particular, the central contributions of this thesis are: proving new communication lower bounds for nearly all classical direct linear algebra computations (dense or sparse), including factorizations for solving linear systems, least squares problems, and eigenvalue and singular value problems; proving new communication lower bounds for Strassen's and other fast matrix multiplication algorithms; proving new parallel communication lower bounds for classical and fast computations that set limits on an algorithm's ability to perfectly strong scale; summarizing the state-of-the-art in communication efficiency for both sequential and parallel algorithms for the computations to which the lower bounds apply; developing a new communication-optimal algorithm for computing a symmetric-indefinite factorization (observing speedups of up to 2.8x compared to alternative shared-memory parallel algorithms); developing new, more communication-efficient algorithms for reducing a symmetric band matrix to tridiagonal form via orthogonal similar transformations (observing speedups of 2--6x compared to alternative sequential and parallel algorithms); and developing a new communication-optimal parallelization of Strassen's matrix multiplication algorithm (observing speedups of up to 2.84x compared to alternative distributed-memory parallel algorithms).

[1]  K. Murata,et al.  A New Method for the Tridiagonalization of the Symmetric Band Matrix , 1975 .

[2]  James Demmel,et al.  Implementing a Blocked Aasen's Algorithm with a Dynamic Scheduler on Multicore Architectures , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[3]  James Demmel,et al.  Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds , 2012, SPAA '12.

[4]  Michael A. Heroux,et al.  GEMMW: A Portable Level 3 BLAS Winograd Variant of Strassen's Matrix-Matrix Multiply Algorithm , 1994, Journal of Computational Physics.

[5]  John E. Savage Extending the Hong-Kung Model to Memory Hierarchies , 1995, COCOON.

[6]  James Demmel,et al.  Fast linear algebra is stable , 2006, Numerische Mathematik.

[7]  James Demmel,et al.  Avoiding Communication in Successive Band Reduction , 2015, ACM Trans. Parallel Comput..

[8]  James Demmel,et al.  Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..

[9]  James Demmel,et al.  Fast matrix multiplication is stable , 2006, Numerische Mathematik.

[10]  Stephen Warshall,et al.  A Theorem on Boolean Matrices , 1962, JACM.

[11]  James Demmel,et al.  IEEE Standard for Floating-Point Arithmetic , 2008 .

[12]  James Demmel,et al.  LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version , 2012, SIAM J. Matrix Anal. Appl..

[13]  Ramesh Subramonian,et al.  LogP: a practical model of parallel computation , 1996, CACM.

[14]  Sivan Toledo Locality of Reference in LU Decomposition with Partial Pivoting , 1997, SIAM J. Matrix Anal. Appl..

[15]  Robert A. van de Geijn,et al.  Parallel out-of-core computation and updating of the QR factorization , 2005, TOMS.

[16]  Christian H. Bischof,et al.  The WY representation for products of householder matrices , 1985, PPSC.

[17]  Alexander Tiskin,et al.  Memory-Efficient Matrix Multiplication in the BSP Model , 1999, Algorithmica.

[18]  G. Golub,et al.  Parallel block schemes for large-scale least-squares computations , 1988 .

[19]  Nicholas J. Higham,et al.  INVERSE PROBLEMS NEWSLETTER , 1991 .

[20]  Nicholas J. Higham,et al.  Stable and Efficient Spectral Divide and Conquer Algorithms for the Symmetric Eigenvalue Decomposition and the SVD , 2013, SIAM J. Sci. Comput..

[21]  Lars Karlsson,et al.  Parallel two-stage reduction to Hessenberg form using dynamic scheduling on shared-memory architectures , 2011, Parallel Comput..

[22]  James Demmel,et al.  Brief announcement: Lower bounds on communication for sparse Cholesky factorization of a model problem , 2010, SPAA '10.

[23]  J. Demmel,et al.  An inverse free parallel spectral divide and conquer algorithm for nonsymmetric eigenproblems , 1997 .

[24]  Robert B. Wilhelmson High-speed computing: scientific applications and algorithm design , 1988 .

[25]  James Demmel,et al.  Communication-Avoiding QR Decomposition for GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[26]  Christian H. Bischof,et al.  Parallel Bandreduction and Tridiagonalization , 1993, PPSC.

[27]  Jack Dongarra,et al.  Experiments with Strassen's Algorithm: From Sequential to Parallel , 2006 .

[28]  Fred G. Gustavson,et al.  Recursion leads to automatic variable blocking for dense linear-algebra algorithms , 1997, IBM J. Res. Dev..

[29]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[30]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[31]  Alexander Tiskin Communication-efficient parallel generic pairwise elimination , 2007, Future Gener. Comput. Syst..

[32]  Ran Raz On the Complexity of Matrix Product , 2003, SIAM J. Comput..

[33]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[34]  Grazia Lotti,et al.  O(n2.7799) Complexity for n*n Approximate Matrix Multiplication , 1979, Inf. Process. Lett..

[35]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[36]  B. S. Garbow,et al.  Matrix Eigensystem Routines — EISPACK Guide , 1974, Lecture Notes in Computer Science.

[37]  Robert A. van de Geijn,et al.  FLAME: Formal Linear Algebra Methods Environment , 2001, TOMS.

[38]  James Hardy Wilkinson,et al.  Reduction of the symmetric eigenproblemAx=λBx and related problems to standard form , 1968 .

[39]  Robert A. van de Geijn,et al.  A High Performance Parallel Strassen Implementation , 1995, Parallel Process. Lett..

[40]  J. Demmel An arithmetic complexity lower bound for computing rational functions, with applications to linear algebra , 2013 .

[41]  Bruno Lang,et al.  A Parallel Algorithm for Reducing Symmetric Banded Matrices to Tridiagonal Form , 1993, SIAM J. Sci. Comput..

[42]  Greg Henry,et al.  Application of a High Performance Parallel Eigensolver to Electronic Structure Calculations , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[43]  Robert A. van de Geijn,et al.  Elemental: A New Framework for Distributed Memory Dense Matrix Computations , 2013, TOMS.

[44]  James Demmel,et al.  Communication Avoiding Rank Revealing QR Factorization with Column Pivoting , 2015, SIAM J. Matrix Anal. Appl..

[45]  Jack J. Dongarra,et al.  Two-Stage Tridiagonal Reduction for Dense Symmetric Matrices Using Tile Algorithms on Multicore Architectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[46]  Marc Snir,et al.  GETTING UP TO SPEED THE FUTURE OF SUPERCOMPUTING , 2004 .

[47]  P. Sadayappan,et al.  A tensor product formulation of Strassen's matrix multiplication algorithm with memory reduction , 1993, [1993] Proceedings Seventh International Parallel Processing Symposium.

[48]  Christian H. Bischof,et al.  Algorithm 807: The SBR Toolbox—software for successive band reduction , 2000, TOMS.

[49]  J. Bunch,et al.  Some stable methods for calculating inertia and solving symmetric linear systems , 1977 .

[50]  C. Puglisi Modification of the householder method based on the compact WY representation , 1992 .

[51]  Thomas Auckenthaler,et al.  Highly scalable eigensolvers for petaflop applications , 2012 .

[52]  L. Trefethen,et al.  Average-case stability of Gaussian elimination , 1990 .

[53]  C. Loan,et al.  A Storage-Efficient $WY$ Representation for Products of Householder Transformations , 1989 .

[54]  Robert A. van de Geijn,et al.  SUMMA: Scalable Universal Matrix Multiplication Algorithm , 1995 .

[55]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[56]  Shmuel Winograd,et al.  On multiplication of 2 × 2 matrices , 1971 .

[57]  James Demmel,et al.  Communication-optimal parallel algorithm for strassen's matrix multiplication , 2012, SPAA '12.

[58]  Jack J. Dongarra,et al.  Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[59]  Ramesh C. Agarwal,et al.  A three-dimensional approach to parallel matrix multiplication , 1995, IBM J. Res. Dev..

[60]  Guy E. Blelloch,et al.  Provably good multicore cache performance for divide-and-conquer algorithms , 2008, SODA '08.

[61]  R. Tarjan,et al.  The analysis of a nested dissection algorithm , 1987 .

[62]  James Demmel,et al.  Communication-Avoiding Parallel Strassen: Implementation and performance , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[63]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[64]  A. Tiskin Bulk-Synchronous Parallel Gaussian Elimination , 2002 .

[65]  H. Schwarz Tridiagonalization of a symetric band matrix , 1968 .

[66]  Matteo Frigo,et al.  Cache-oblivious algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[67]  Thomas Rauber,et al.  Combining building blocks for parallel multi-level matrix multiplication , 2008, Parallel Comput..

[68]  D. Rose,et al.  Complexity Bounds for Regular Finite Difference and Finite Element Grids , 1973 .

[69]  Alok Aggarwal,et al.  Communication Complexity of PRAMs , 1990, Theor. Comput. Sci..

[70]  James Demmel,et al.  LAPACK Users' Guide, Third Edition , 1999, Software, Environments and Tools.

[71]  Jarle Berntsen,et al.  Communication efficient matrix multiplication on hypercubes , 1989, Parallel Comput..

[72]  James Demmel,et al.  Minimizing communication in sparse matrix solvers , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[73]  Leslie G. Valiant,et al.  A bridging model for multi-core computing , 2008, J. Comput. Syst. Sci..

[74]  N. Higham Notes on Accuracy and Stability of Algorithms in Numerical Linear Algebra , 1999 .

[75]  V. Strassen Gaussian elimination is not optimal , 1969 .

[76]  G. Miller On the Solution of a System of Linear Equations , 1910 .

[77]  Sraban Kumar Mohanty,et al.  I/O efficient QR and QZ algorithms , 2012, 2012 19th International Conference on High Performance Computing.

[78]  James Demmel,et al.  Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[79]  James Demmel,et al.  Communication efficient gaussian elimination with partial pivoting using a shape morphing data layout , 2013, SPAA.

[80]  James Demmel,et al.  CALU: A Communication Optimal LU Factorization Algorithm , 2011, SIAM J. Matrix Anal. Appl..

[81]  Erik Elmroth,et al.  New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems , 1998, PARA.

[82]  James Demmel,et al.  Graph Expansion Analysis for Communication Costs of Fast Rectangular Matrix Multiplication , 2012, MedAlg.

[83]  James Demmel,et al.  Communication avoiding successive band reduction , 2012, PPoPP '12.

[84]  James Demmel,et al.  Applied Numerical Linear Algebra , 1997 .

[85]  A. George Nested Dissection of a Regular Finite Element Mesh , 1973 .

[86]  Lukas Krämer,et al.  Parallel solution of partial symmetric eigenvalue problems from electronic structure calculations , 2011, Parallel Comput..

[87]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[88]  F. V. Zee Restructuring the QR Algorithm for Performance , 2011 .

[89]  Viktor K. Prasanna,et al.  Optimizing graph algorithms for improved cache performance , 2004, Proceedings 16th International Parallel and Distributed Processing Symposium.

[90]  Jack Dongarra,et al.  Computational Science: Ensuring America's Competitiveness , 2005 .

[91]  James Demmel,et al.  Communication optimal parallel multiplication of sparse random matrices , 2013, SPAA.

[92]  Linda Kaufman,et al.  The retraction algorithm for factoring banded symmetric matrices , 2007, Numer. Linear Algebra Appl..

[93]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[94]  Frédéric Suter,et al.  Impact of mixed-parallelism on parallel implementations of the Strassen and Winograd matrix multiplication algorithms: Research Articles , 2004 .

[95]  Shang-Hua Teng,et al.  Smoothed Analysis of the Condition Numbers and Growth Factors of Matrices , 2003, SIAM J. Matrix Anal. Appl..

[96]  Mark Ainsworth,et al.  The graduate student's guide to numerical analysis '98 : lecture notes from the VIII EPSRC Summer School in Numerical Analysis , 1999 .

[97]  Mohammad Zubair,et al.  A unified model for multicore architectures , 2008, IFMT '08.

[98]  Gil Shklarski,et al.  Partitioned Triangular Tridiagonalization , 2011, TOMS.

[99]  J. O. Aasen On the reduction of a symmetric matrix to tridiagonal form , 1971 .

[100]  Christian H. Bischof,et al.  A framework for symmetric band reduction , 2000, TOMS.

[101]  Joseph F. Traub Complexity of Sequential and Parallel Numerical Algorithms , 1973 .

[102]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[103]  Butler W. Lampson,et al.  Annual Review of Computer Science , 1986 .

[104]  Linda Kaufman Band reduction algorithms revisited , 2000, TOMS.

[105]  Jeremy D. Frens,et al.  QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism , 2003, PPoPP '03.

[106]  D. Eppstein,et al.  Parallel Algorithmic Techniques for Combinatorial Computation , 1988 .

[107]  Ramesh C. Agarwal,et al.  A high-performance matrix-multiplication algorithm on a distributed-memory parallel computer, using overlapped communication , 1994, IBM J. Res. Dev..

[108]  T. H Axford Annual review of computer science. Volume 4, 1989–1990 , 1991 .

[109]  Daniel Kressner,et al.  Algorithm 953 , 2015 .

[110]  Keshav Pingali,et al.  An experimental comparison of cache-oblivious and cache-conscious programs , 2007, SPAA '07.

[111]  C. H. Bischof,et al.  A framework for symmetric band reduction and tridiagonalization , 1994 .

[112]  James Demmel,et al.  Minimizing Communication in All-Pairs Shortest Paths , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[113]  Sivan Toledo,et al.  THE SNAP-BACK PIVOTING METHOD FOR SYMMETRIC , 2006 .

[114]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[115]  James Demmel,et al.  Graph expansion and communication costs of fast matrix multiplication: regular submission , 2011, SPAA '11.

[116]  Linda Kaufman,et al.  Banded Eigenvalue Solvers on Vector Machines , 1984, TOMS.

[117]  Qingshan Luo,et al.  A scalable parallel Strassen's matrix multiplication algorithm for distributed-memory computers , 1995, SAC '95.

[118]  James Demmel,et al.  Communication-Avoiding Symmetric-Indefinite Factorization , 2014, SIAM J. Matrix Anal. Appl..

[119]  Benjamin Lipshitz,et al.  Communication-Avoiding Parallel Recursive Algorithms for Matrix Multiplication , 2013 .

[120]  Keshav Pingali,et al.  Automatic Generation of Block-Recursive Codes , 2000, Euro-Par.

[121]  James Demmel,et al.  Communication lower bounds and optimal algorithms for programs that reference arrays - Part 1 , 2013, ArXiv.

[122]  James Hardy Wilkinson,et al.  Householder's method for symmetric matrices , 1962 .

[123]  Virginia Vassilevska Williams,et al.  Multiplying matrices faster than coppersmith-winograd , 2012, STOC '12.

[124]  James Demmel,et al.  Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.

[125]  David S. Wise Ahnentafel Indexing into Morton-Ordered Arrays, or Matrix Locality for Free , 2000, Euro-Par.

[126]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[127]  Lynn Elliot Cannon,et al.  A cellular computer to implement the kalman filter algorithm , 1969 .

[128]  James Demmel,et al.  Brief announcement: communication bounds for heterogeneous architectures , 2011, SPAA '11.

[129]  James Demmel,et al.  Communication-optimal Parallel and Sequential Cholesky Decomposition , 2009, SIAM J. Sci. Comput..

[130]  H. Whitney,et al.  An inequality related to the isoperimetric inequality , 1949 .

[131]  Xiaobai Sun,et al.  Parallel tridiagonalization through two-step band reduction , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[132]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[133]  Lukas Krämer,et al.  Developing algorithms and software for the parallel solution of the symmetric eigenvalue problem , 2011, J. Comput. Sci..

[134]  Katherine A. Yelick,et al.  Communication avoiding and overlapping for numerical linear algebra , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[135]  James Demmel,et al.  Performance and Accuracy of LAPACK's Symmetric Tridiagonal Eigensolvers , 2008, SIAM J. Sci. Comput..

[136]  Robert A. van de Geijn,et al.  Collective communication: theory, practice, and experience , 2007, Concurr. Comput. Pract. Exp..

[137]  James Demmel,et al.  Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..

[138]  H. Rutishauser On jacobi rotation patterns , 1963 .

[139]  Barton P. Miller,et al.  Critical path analysis for the execution of parallel and distributed programs , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[140]  L. R. Kerr,et al.  On Minimizing the Number of Multiplications Necessary for Matrix Multiplication , 1969 .

[141]  J. Demmel,et al.  Sequential Communication Bounds for Fast Linear Algebra , 2012 .

[142]  Jack Dongarra,et al.  ScaLAPACK Users' Guide , 1987 .

[143]  T. Tao,et al.  Finite bounds for Hölder-Brascamp-Lieb multilinear inequalities , 2005, math/0505691.

[144]  Sivasankaran Rajamanickam,et al.  EFFICIENT ALGORITHMS FOR SPARSE SINGULAR VALUE DECOMPOSITION , 2009 .

[145]  Gianfranco Bilardi,et al.  A Lower Bound Technique for Communication on BSP with Application to the FFT , 2012, Euro-Par.