Reconstructing Householder Vectors from Tall-Skinny QR

The Tall-Skinny QR (TSQR) algorithm is more communication efficient than the standard Householder algorithm for QR decomposition of matrices with many more rows than columns. However, TSQR produces a different representation of the orthogonal factor and therefore requires more software development to support the new representation. Further, implicitly applying the orthogonal factor to the trailing matrix in the context of factoring a square matrix is more complicated and costly than with the Householder representation. We show how to perform TSQR and then reconstruct the Householder vector representation with the same asymptotic communication efficiency and little extra computational cost. We demonstrate the high performance and numerical stability of this algorithm both theoretically and empirically. The new Householder reconstruction algorithm allows us to design more efficient parallel QR algorithms, with significantly lower latency cost compared to Householder QR and lower bandwidth and latency costs compared with Communication-Avoiding QR (CAQR) algorithm. As a result, our final parallel QR algorithm outperforms ScaLAPACK and Elemental implementations of Householder QR and our implementation of CAQR on the Hopper Cray XE6 NERSC system. We also provide algorithmic improvements to the ScaLAPACK and CAQR algorithms.

[1]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[2]  James Demmel,et al.  Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..

[3]  Christian H. Bischof,et al.  The WY representation for products of householder matrices , 1985, PPSC.

[4]  C. Bischof,et al.  On orthogonal block elimination , 1996 .

[5]  James Demmel,et al.  Minimizing communication in sparse matrix solvers , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[6]  Alexander Tiskin Communication-efficient parallel generic pairwise elimination , 2007, Future Gener. Comput. Syst..

[7]  Yusaku Yamamoto,et al.  Roundoff error analysis of the Cholesky QR2 algorithm , 2015 .

[8]  James Demmel,et al.  LAPACK Users' Guide, Third Edition , 1999, Software, Environments and Tools.

[9]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[10]  B. Parlett,et al.  Block reflectors: theory and computation , 1988 .

[11]  James Demmel,et al.  Reconstructing Householder Vectors from Tall-Skinny QR , 2014, IPDPS.

[12]  A. Farley Broadcast Time in Communication Networks , 1980 .

[13]  Robert A. van de Geijn,et al.  Collective communication: theory, practice, and experience , 2007, Concurr. Comput. Pract. Exp..

[14]  James Demmel,et al.  Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..

[15]  Yusaku Yamamoto,et al.  CholeskyQR2: A Simple and Communication-Avoiding Algorithm for Computing a Tall-Skinny QR Factorization on a Large-Scale Parallel System , 2014, 2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems.

[16]  Thomas Hérault,et al.  QR factorization of tall and skinny matrices in a grid computing environment , 2009, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[17]  Mark Hoemmen,et al.  Communication-avoiding Krylov subspace methods , 2010 .

[18]  Jack Dongarra,et al.  ScaLAPACK Users' Guide , 1987 .

[19]  Jack J. Dongarra,et al.  Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[20]  Thomas Hérault,et al.  Hierarchical QR factorization algorithms for multi-core clusters , 2013, Parallel Comput..

[21]  C. Loan,et al.  A Storage-Efficient $WY$ Representation for Products of Householder Transformations , 1989 .

[22]  Jesper Larsson Träff,et al.  Optimal Broadcast for Fully Connected Networks , 2005, HPCC.

[23]  Mark Hoemmen,et al.  A Communication-Avoiding, Hybrid-Parallel, Rank-Revealing Orthogonalization Method , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[24]  Yusaku Yamamoto,et al.  Backward error analysis of the AllReduce algorithm for householder QR decomposition , 2012 .

[25]  James Demmel,et al.  Communication-Avoiding QR Decomposition for GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[26]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[27]  G. Golub,et al.  Parallel block schemes for large-scale least-squares computations , 1988 .

[28]  Nicholas J. Higham,et al.  INVERSE PROBLEMS NEWSLETTER , 1991 .

[29]  Christian H. Bischof,et al.  A Basis-Kernel Representation of Orthogonal Matrices , 1995, SIAM J. Matrix Anal. Appl..

[30]  Robert A. van de Geijn,et al.  Elemental: A New Framework for Distributed Memory Dense Matrix Computations , 2013, TOMS.

[31]  James Demmel,et al.  Communication Avoiding Rank Revealing QR Factorization with Column Pivoting , 2015, SIAM J. Matrix Anal. Appl..

[32]  C. Puglisi Modification of the householder method based on the compact WY representation , 1992 .

[33]  Thomas Huckle,et al.  A blocked QR-decomposition for the parallel symmetric eigenvalue problem , 2014, Parallel Comput..