Fast Blocking of Householder Reflectors on Graphics Processors

We revisit an alternative representation to the compact WY transform for the accumulation (blocking) of Householder reflectors that exhibits the same numerical stability and is composed of efficient computational kernels from Level-3 Basic Linear Algebra Subprograms (BLAS) in contrast with the Level-2 BLAS that are utilized for the construction of the conventional compact WY representation. For the orthogonal reduction to condensed forms on multicore platforms equipped with a fast graphics processing unit (GPU), (or when there is a notable gap in performance between the multicore processors and the graphics accelerator,) our approach removes the assembly of the accumulation from the critical path of the algorithm. This comes as a consequence of accelerating this operation via the use of Level-3 BLAS, moving this computation to the GPU, and allowing the use of larger algorithmic block sizes. Our experiments with the alternative orthogonal representation show considerable speed-ups, which can be in the range 20-40% on recent GPUs when compared with the codes in MAGMA.

[1]  C. Loan,et al.  A Storage-Efficient $WY$ Representation for Products of Householder Transformations , 1989 .

[2]  H. Walker Implementation of the GMRES method using householder transformations , 1988 .

[3]  Tze Meng Low,et al.  Accumulating Householder transformations, revisited , 2006, TOMS.

[4]  Jack J. Dongarra,et al.  Accelerating Numerical Dense Linear Algebra Calculations with GPUs , 2014, Numerical Computations with GPUs.

[5]  Jack Dongarra,et al.  Some issues in dense linear algebra for multicore and special purpose architectures , 2008 .

[6]  James Demmel,et al.  LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs , 2008 .

[7]  P. Strazdins A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization , 1998 .

[8]  C. Puglisi Modification of the householder method based on the compact WY representation , 1992 .

[9]  Sergey V. Kuznetsov,et al.  An Approach of the QR Factorization for Tall-and-Skinny Matrices on Multicore Platforms , 2012, PARA.

[10]  Jack J. Dongarra,et al.  Towards dense linear algebra for hybrid GPU accelerated manycore systems , 2009, Parallel Comput..

[11]  Christian H. Bischof,et al.  The WY representation for products of householder matrices , 1985, PPSC.

[12]  James Demmel,et al.  Reconstructing Householder Vectors from Tall-Skinny QR , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[13]  B. Parlett,et al.  Block reflectors: theory and computation , 1988 .

[14]  Jack J. Dongarra,et al.  Dense linear algebra solvers for multicore with GPU accelerators , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[15]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[16]  Enrique S. Quintana-Ortí,et al.  A Case for Malleable Thread-Level Linear Algebra Libraries: The LU Factorization With Partial Pivoting , 2016, IEEE Access.

[17]  Alston S. Householder,et al.  Unitary Triangularization of a Nonsymmetric Matrix , 1958, JACM.

[18]  Robert A. van de Geijn,et al.  FLAME: Formal Linear Algebra Methods Environment , 2001, TOMS.