Communication-optimal Parallel and Sequential QR and LU Factorizations

We present parallel and sequential dense QR factorization algorithms that are both optimal (up to polylogarithmic factors) in the amount of communication they perform, and just as stable as Householder QR. Our first algorithm, Tall Skinny QR (TSQR), factors m-by-n matrices in a one-dimensional (1-D) block cyclic row layout, and is optimized for m >> n. Our second algorithm, CAQR (Communication-Avoiding QR), factors general rectangular matrices distributed in a two-dimensional block cyclic layout. It invokes TSQR for each block column factorization.

[1]  H. Whitney,et al.  An inequality related to the isoperimetric inequality , 1949 .

[2]  Å. Björck Solving linear least squares problems by Gram-Schmidt orthogonalization , 1967 .

[3]  V. Strassen Gaussian elimination is not optimal , 1969 .

[4]  Lynn Elliot Cannon,et al.  A cellular computer to implement the kalman filter algorithm , 1969 .

[5]  N. Abdelmalek Round off error analysis for Gram-Schmidt method and solution of linear least squares problems , 1971 .

[6]  A. Kiełbasiński Analiza numeryczna algorytmu ortogonalizacji Grama-Schmidta , 1974 .

[7]  L. Csanky,et al.  Fast parallel matrix inversion algorithms , 1975, 16th Annual Symposium on Foundations of Computer Science (sfcs 1975).

[8]  David J. Kuck,et al.  On Stable Parallel Linear System Solvers , 1978, JACM.

[9]  B. Parlett The Symmetric Eigenvalue Problem , 1981 .

[10]  D. O’Leary The block conjugate gradient algorithm and related methods , 1980 .

[11]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[12]  Yves Robert,et al.  Complexité de la factorisation QR en parallèle , 1982 .

[13]  Don Coppersmith,et al.  On the Asymptotic Complexity of Matrix Multiplication , 1982, SIAM J. Comput..

[14]  Gene H. Golub,et al.  Matrix computations , 1983 .

[15]  J. J. Modi,et al.  An alternative givens ordering , 1984 .

[16]  M. Cosnard,et al.  Parallel QR decomposition of a rectangular matrix , 1986 .

[17]  Ed Anderson,et al.  LAPACK Users' Guide , 1995 .

[18]  Jack Dongarra,et al.  ScaLAPACK Users' Guide , 1987 .

[19]  G. Golub,et al.  Parallel block schemes for large-scale least-squares computations , 1988 .

[20]  Robert B. Wilhelmson High-speed computing: scientific applications and algorithm design , 1988 .

[21]  C. Loan,et al.  A Storage-Efficient $WY$ Representation for Products of Householder Transformations , 1989 .

[22]  B. Vital Etude de quelques methodes de resolution de problemes lineaires de grande taille sur multiprocesseur , 1990 .

[23]  J. Demmel Trading Off Parallelism and Numerical Stability , 1992 .

[24]  E. Ng,et al.  Predicting structure in nonsymmetric sparse matrix factorizations , 1993 .

[25]  H. Sagan Space-filling curves , 1994 .

[26]  Anthony Skjellum,et al.  Using MPI - portable parallel programming with the message-parsing interface , 1994 .

[27]  Jaeyoung Choi,et al.  Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines , 1994, Sci. Program..

[28]  R. Freund,et al.  A block QMR algorithm for non-Hermitian linear systems with multiple right-hand sides , 1997 .

[29]  Jack J. Dongarra,et al.  Key Concepts for Parallel Out-of-Core LU Factorization , 1996, Parallel Comput..

[30]  Jack Dongarra,et al.  The Design and Implementation of the Parallel Out-of-coreScaLAPACK LU, QR, and Cholesky Factorization Routines , 1997 .

[31]  M. Rozložník,et al.  Numerical behaviour of the modified gram-schmidt GMRES implementation , 1997 .

[32]  Sivan Toledo Locality of Reference in LU Decomposition with Partial Pivoting , 1997, SIAM J. Matrix Anal. Appl..

[33]  Sivan Toledo,et al.  A survey of out-of-core algorithms in numerical linear algebra , 1999, External Memory Algorithms.

[34]  Erik Elmroth,et al.  New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems , 1998, PARA.

[35]  James Demmel,et al.  LAPACK Users' Guide, Third Edition , 1999, Software, Environments and Tools.

[36]  Mauro Leoncini,et al.  Parallel Complexity of Numerically Accurate Linear System Solvers , 1999, SIAM J. Comput..

[37]  Jack Dongarra,et al.  Templates for the Solution of Algebraic Eigenvalue Problems , 2000, Software, environments, tools.

[38]  Erik Elmroth,et al.  Applying recursion to serial and parallel QR factorization leads to better performance , 2000, IBM J. Res. Dev..

[39]  Jack J. Dongarra,et al.  The design and implementation of the parallel out-of-core ScaLAPACK LU, QR, and Cholesky factorization routines , 2000, Concurr. Pract. Exp..

[40]  Sivan Toledo,et al.  Out-of-Core SVD and QR Decompositions , 2001, PPSC.

[41]  Rudnei Dias da Cunha,et al.  New Parallel (Rank-Revealing) QR Factorization Algorithms , 2002, Euro-Par.

[42]  Kesheng Wu,et al.  A Block Orthogonalization Procedure with Constant Synchronization Requirements , 2000, SIAM J. Sci. Comput..

[43]  Ran Raz,et al.  On the complexity of matrix product , 2002, STOC '02.

[44]  Lothar Reichel,et al.  Algorithm 827: irbleigs: A MATLAB program for computing a few eigenpairs of a large sparse Hermitian matrix , 2003, TOMS.

[45]  Jeremy D. Frens,et al.  QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism , 2003, PPoPP '03.

[46]  Erik Elmroth,et al.  SIAM REVIEW c ○ 2004 Society for Industrial and Applied Mathematics Vol. 46, No. 1, pp. 3–45 Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software ∗ , 2022 .

[47]  Marc Snir,et al.  GETTING UP TO SPEED THE FUTURE OF SUPERCOMPUTING , 2004 .

[48]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[49]  Robert A. van de Geijn,et al.  Parallel out-of-core computation and updating of the QR factorization , 2005, TOMS.

[50]  Y. Danieli Guide , 2005 .

[51]  Julien Langou,et al.  A note on the error analysis of classical Gram–Schmidt , 2006, Numerische Mathematik.

[52]  Richard B. Lehoucq,et al.  Basis selection in LOBPCG , 2006, J. Comput. Phys..

[53]  Merico E. Argentati,et al.  Block Locally Optimal Preconditioned Eigenvalue Xolvers (BLOPEX) in hypre and PETSc , 2007, SIAM J. Sci. Comput..

[54]  James Demmel,et al.  Fast linear algebra is stable , 2006, Numerische Mathematik.

[55]  DongarraJack,et al.  Parallel tiled QR factorization for multicore architectures , 2008 .

[56]  Jack Dongarra,et al.  QR Factorization for the CELL Processor , 2008 .

[57]  Robert A. van de Geijn,et al.  Design of scalable dense linear algebra libraries for multithreaded architectures: the LU factorization , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[58]  James Demmel,et al.  Communication Avoiding Gaussian elimination , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[59]  James Demmel,et al.  Communication-avoiding parallel and sequential QR factorizations , 2008, ArXiv.

[60]  Robert A. van de Geijn,et al.  Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures , 2008, 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008).

[61]  Jack Dongarra,et al.  Some issues in dense linear algebra for multicore and special purpose architectures , 2008 .

[62]  Julien Langou,et al.  Parallel tiled QR factorization for multicore architectures , 2007, Concurr. Comput. Pract. Exp..

[63]  J. Demmel,et al.  Implementing Communication-Optimal Parallel and Sequential QR Factorizations , 2008, 0809.2407.

[64]  George Almási,et al.  Performance without pain = productivity: data layout and collective communication in UPC , 2008, PPoPP.

[65]  James Demmel,et al.  Nonnegative Diagonals and High Performance on Low-Profile Matrices from Householder QR , 2009, SIAM J. Sci. Comput..

[66]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[67]  James Demmel,et al.  Minimizing communication in sparse matrix solvers , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[68]  S. Gratton,et al.  Parallel Tools for Solving Incremental Dense Least Squares Problems: Application to Space Geodesy , 2009 .

[69]  Mark Hoemmen,et al.  Communication-avoiding Krylov subspace methods , 2010 .

[70]  James Demmel,et al.  CALU: A Communication Optimal LU Factorization Algorithm , 2011, SIAM J. Matrix Anal. Appl..

[71]  James Demmel,et al.  Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..