论文信息 - Tall-and-skinny QR factorization with approximate Householder reflectors on graphics processors

Tall-and-skinny QR factorization with approximate Householder reflectors on graphics processors

We present a novel method for the QR factorization of large tall-and-skinny matrices that introduces an approximation technique for computing the Householder vectors. This approach is very competitive on a hybrid platform equipped with a graphics processor, with a performance advantage over the conventional factorization due to the reduced amount of data transfers between the graphics accelerator and the main memory of the host. Our experiments show that, for tall–skinny matrices, the new approach outperforms the code in MAGMA by a large margin, while it is very competitive for square matrices when the memory transfers and CPU computations are the bottleneck of the Householder QR factorization.

Andrés Tomás | Enrique S. Quintana-Ortí

[1] Yusaku Yamamoto,et al. Shifted Cholesky QR for Computing the QR Factorization of Ill-Conditioned Matrices , 2018, SIAM J. Sci. Comput..

[2] Yusaku Yamamoto,et al. CholeskyQR2: A Simple and Communication-Avoiding Algorithm for Computing a Tall-Skinny QR Factorization on a Large-Scale Parallel System , 2014, 2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems.

[3] James Demmel,et al. Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..

[4] Robert A. van de Geijn,et al. Parallel out-of-core computation and updating of the QR factorization , 2005, TOMS.

[5] Yusaku Yamamoto,et al. Roundoff error analysis of the Cholesky QR2 algorithm , 2015 .

[6] Stanimire Tomov,et al. Analysis and Design Techniques towards High-Performance and Energy-Efficient Dense Linear Solvers on GPUs , 2018, IEEE Transactions on Parallel and Distributed Systems.

[7] James Demmel,et al. LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs , 2008 .

[8] P. Strazdins. A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization , 1998 .

[9] Kesheng Wu,et al. A Block Orthogonalization Procedure with Constant Synchronization Requirements , 2000, SIAM J. Sci. Comput..

[10] James Demmel,et al. Reconstructing Householder Vectors from Tall-Skinny QR , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[11] C. Puglisi. Modification of the householder method based on the compact WY representation , 1992 .