Communication-avoiding parallel and sequential QR factorizations

We present parallel and sequential dense QR factorization algorithms that are optimized to avoid communication. Some of these are novel, and some extend earlier work. Communication includes both messages between processors (in the parallel case), and data movement between slow and fast memory (in either the sequential or parallel cases). Our first algorithm, Tall Skinny QR (TSQR), factors m× n matrices in a one-dimensional (1-D) block cyclic row layout, storing the Q factor (if desired) implictly as a tree of blocks of Householder reflectors. TSQR is optimized for matrices with many more rows than columns (hence the name). In the parallel case, TSQR requires no more than the minimum number of messages Θ(logP ) between P processors. In the sequential case, TSQR transfers 2mn + o(mn) words between slow and fast memory, which is the theoretical lower bound, and performs Θ(mn/W ) block reads and writes (as a function of the fast memory size W ), which is within a constant factor of the theoretical lower bound. In contrast, the conventional parallel algorithm as implemented in ScaLAPACK requires Θ(n logP ) messages, a factor of n times more, and the analogous sequential algorithm transfers Θ(mn) words between slow and fast memory, also a factor of n times more. TSQR only uses orthogonal transforms, so it is just as stable as standard Householder QR. Both parallel and sequential performance results show that TSQR outperforms competing methods. Our second algorithm, CAQR (Communication-Avoiding QR), factors general rectangular matrices distributed in a two-dimensional block cyclic layout. It invokes TSQR for each block column factorization, which both remove a latency bottleneck in ScaLAPACK’s current parallel approach, and both bandwidth and latency bottlenecks in ScaLAPACK’s out-of-core QR factorization. CAQR achieves modeled speedups of 2.1× on an IBM POWER5 cluster, 3.0× on a future petascale machine, and 3.8× on the Grid.

[1]  Rudnei Dias da Cunha,et al.  New Parallel (Rank-Revealing) QR Factorization Algorithms , 2002, Euro-Par.

[2]  D. O’Leary The block conjugate gradient algorithm and related methods , 1980 .

[3]  Richard B. Lehoucq,et al.  Basis selection in LOBPCG , 2006, J. Comput. Phys..

[4]  Julien Langou,et al.  A note on the error analysis of classical Gram–Schmidt , 2006, Numerische Mathematik.

[5]  L. Csanky,et al.  Fast parallel matrix inversion algorithms , 1975, 16th Annual Symposium on Foundations of Computer Science (sfcs 1975).

[6]  Erik Elmroth,et al.  New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems , 1998, PARA.

[7]  Sivan Toledo,et al.  Out-of-Core SVD and QR Decompositions , 2001, PPSC.

[8]  C. Loan,et al.  A Storage-Efficient $WY$ Representation for Products of Householder Transformations , 1989 .

[9]  Sivan Toledo,et al.  A survey of out-of-core algorithms in numerical linear algebra , 1999, External Memory Algorithms.

[10]  Å. Björck Solving linear least squares problems by Gram-Schmidt orthogonalization , 1967 .

[11]  M. Rozložník,et al.  Numerical behaviour of the modified gram-schmidt GMRES implementation , 1997 .

[12]  B. Vital Etude de quelques methodes de resolution de problemes lineaires de grande taille sur multiprocesseur , 1990 .

[13]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[14]  Jack J. Dongarra,et al.  Key Concepts for Parallel Out-of-Core LU Factorization , 1996, Parallel Comput..

[15]  J. Demmel Trading Off Parallelism and Numerical Stability , 1992 .

[16]  Marc Snir,et al.  GETTING UP TO SPEED THE FUTURE OF SUPERCOMPUTING , 2004 .

[17]  David J. Kuck,et al.  On Stable Parallel Linear System Solvers , 1978, JACM.

[18]  M. Cosnard,et al.  Parallel QR decomposition of a rectangular matrix , 1986 .

[19]  R. Freund,et al.  A block QMR algorithm for non-Hermitian linear systems with multiple right-hand sides , 1997 .

[20]  Jack Dongarra,et al.  ScaLAPACK Users' Guide , 1987 .

[21]  Mark Hoemmen,et al.  Communication-avoiding Krylov subspace methods , 2010 .

[22]  Erik Elmroth,et al.  Applying recursion to serial and parallel QR factorization leads to better performance , 2000, IBM J. Res. Dev..

[23]  George Almási,et al.  Performance without pain = productivity: data layout and collective communication in UPC , 2008, PPoPP.

[24]  Kesheng Wu,et al.  A Block Orthogonalization Procedure with Constant Synchronization Requirements , 2000, SIAM J. Sci. Comput..

[25]  J. J. Modi,et al.  An alternative givens ordering , 1984 .

[26]  N. Abdelmalek Round off error analysis for Gram-Schmidt method and solution of linear least squares problems , 1971 .

[27]  Merico E. Argentati,et al.  Block Locally Optimal Preconditioned Eigenvalue Xolvers (BLOPEX) in hypre and PETSc , 2007, SIAM J. Sci. Comput..

[28]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[29]  Jack Dongarra,et al.  The Design and Implementation of the Parallel Out-of-coreScaLAPACK LU, QR, and Cholesky Factorization Routines , 1997 .

[30]  A. Kiełbasiński Analiza numeryczna algorytmu ortogonalizacji Grama-Schmidta , 1974 .

[31]  Lothar Reichel,et al.  Algorithm 827: irbleigs: A MATLAB program for computing a few eigenpairs of a large sparse Hermitian matrix , 2003, TOMS.

[32]  E. Ng,et al.  Predicting structure in nonsymmetric sparse matrix factorizations , 1993 .

[33]  B. Parlett The Symmetric Eigenvalue Problem , 1981 .

[34]  Robert A. van de Geijn,et al.  Parallel out-of-core computation and updating of the QR factorization , 2005, TOMS.

[35]  Yves Robert,et al.  Complexité de la factorisation QR en parallèle , 1982 .