QR decomposition is a computationally intensive linear algebra operation that factors a matrix A into the product of a unitary matrix Q and upper triangular matrix R. Adaptive systems commonly employ QR decomposition to solve overdetermined least squares problems. Performance of QR decomposition is typically the crucial factor limiting problem sizes.
Graphics Processing Units (GPUs) are high-performance processors capable of executing hundreds of floating point operations in parallel. As commodity accelerators for 3D graphics, GPUs offer tremendous computational performance at relatively low costs. While GPUs are favorable to applications with much inherent parallelism requiring coarse-grain synchronization between processors, methods for efficiently utilizing GPUs for algorithms computing QR decomposition remain elusive.
In this paper, we discuss the architectural characteristics of GPUs and explain how a high-performance implementation of QR decomposition may be implemented. We provide detailed performance analysis of the resulting implementation for real-valued matrices and offer recommendations for achieving high performance to future developers of dense linear algebra procedures for GPUs. Our implementation sustains 143 GFLOP/s, and we believe this is the fastest announced QR implementation executing entirely on the GPU.
[1]
David J. Kuck,et al.
On Stable Parallel Linear System Solvers
,
1978,
JACM.
[2]
Henry Hoffmann,et al.
Stream Algorithms and Architecture
,
2004,
J. Instr. Level Parallelism.
[3]
Andrew Kerr,et al.
GPU Performance Assessment with HPEC Challenge
,
2008
.
[4]
Jack Dongarra,et al.
Some issues in dense linear algebra for multicore and special purpose architectures
,
2008
.
[5]
James Demmel,et al.
On computing givens rotations reliably and efficiently
,
2002,
TOMS.
[6]
James Demmel,et al.
Benchmarking GPUs to tune dense linear algebra
,
2008,
2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[7]
Robert H. Halstead,et al.
Matrix Computations
,
2011,
Encyclopedia of Parallel Computing.
[8]
Christian H. Bischof,et al.
The WY representation for products of householder matrices
,
1985,
PPSC.