Tall-and-skinny QR factorization with approximate Householder reflectors on graphics processors

We present a novel method for the QR factorization of large tall-and-skinny matrices that introduces an approximation technique for computing the Householder vectors. This approach is very competitive on a hybrid platform equipped with a graphics processor, with a performance advantage over the conventional factorization due to the reduced amount of data transfers between the graphics accelerator and the main memory of the host. Our experiments show that, for tall–skinny matrices, the new approach outperforms the code in MAGMA by a large margin, while it is very competitive for square matrices when the memory transfers and CPU computations are the bottleneck of the Householder QR factorization.

[1]  Yusaku Yamamoto,et al.  Shifted Cholesky QR for Computing the QR Factorization of Ill-Conditioned Matrices , 2018, SIAM J. Sci. Comput..

[2]  Yusaku Yamamoto,et al.  CholeskyQR2: A Simple and Communication-Avoiding Algorithm for Computing a Tall-Skinny QR Factorization on a Large-Scale Parallel System , 2014, 2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems.

[3]  James Demmel,et al.  Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..

[4]  Robert A. van de Geijn,et al.  Parallel out-of-core computation and updating of the QR factorization , 2005, TOMS.

[5]  Yusaku Yamamoto,et al.  Roundoff error analysis of the Cholesky QR2 algorithm , 2015 .

[6]  Stanimire Tomov,et al.  Analysis and Design Techniques towards High-Performance and Energy-Efficient Dense Linear Solvers on GPUs , 2018, IEEE Transactions on Parallel and Distributed Systems.

[7]  James Demmel,et al.  LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs , 2008 .

[8]  P. Strazdins A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization , 1998 .

[9]  Kesheng Wu,et al.  A Block Orthogonalization Procedure with Constant Synchronization Requirements , 2000, SIAM J. Sci. Comput..

[10]  James Demmel,et al.  Reconstructing Householder Vectors from Tall-Skinny QR , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[11]  C. Puglisi Modification of the householder method based on the compact WY representation , 1992 .

[12]  C. Loan,et al.  A Storage-Efficient $WY$ Representation for Products of Householder Transformations , 1989 .

[13]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[14]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[15]  Rafael Mayo,et al.  Solving Dense Linear Systems on Graphics Processors , 2008, Euro-Par.

[16]  Jack J. Dongarra,et al.  Mixed-Precision Cholesky QR Factorization and Its Case Studies on Multicore CPU with Multiple GPUs , 2015, SIAM J. Sci. Comput..

[17]  James Demmel,et al.  Direct QR factorizations for tall-and-skinny matrices in MapReduce architectures , 2013, 2013 IEEE International Conference on Big Data.

[18]  Zvonimir Bujanovic,et al.  On the Failure of Rank-Revealing QR Factorization Software -- A Case Study , 2008, TOMS.

[19]  Enrique S. Quintana-Ortí,et al.  Fast Blocking of Householder Reflectors on Graphics Processors , 2018, PDP.

[20]  G. Golub,et al.  Linear least squares solutions by householder transformations , 1965 .

[21]  Tze Meng Low,et al.  Accumulating Householder transformations, revisited , 2006, TOMS.