Communication-Avoiding QR Decomposition for GPUs

We describe an implementation of the Communication-Avoiding QR (CAQR) factorization that runs entirely on a single graphics processor (GPU). We show that the reduction in memory traffic provided by CAQR allows us to outperform existing parallel GPU implementations of QR for a large class of tall-skinny matrices. Other GPU implementations of QR handle panel factorizations by either sending the work to a general-purpose processor or using entirely bandwidth-bound operations, incurring data transfer overheads. In contrast, our QR is done entirely on the GPU using compute-bound kernels, meaning performance is good regardless of the width of the matrix. As a result, we outperform CULA, a parallel linear algebra library for GPUs by up to 17x for tall-skinny matrices and Intel's Math Kernel Library (MKL) by up to 12x. We also discuss stationary video background subtraction as a motivating application. We apply a recent statistical approach, which requires many iterations of computing the singular value decomposition of a tall-skinny matrix. Using CAQR as a first step to getting the singular value decomposition, we are able to get the answer 3x faster than if we use a traditional bandwidth-bound GPU QR factorization tuned specifically for that matrix size, and 30x faster than if we use Intel's Math Kernel Library (MKL) singular value decomposition routine on a multicore CPU.

[1]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[2]  D. Donoho,et al.  Maximal Sparsity Representation via l 1 Minimization , 2002 .

[3]  James Demmel,et al.  Minimizing communication in sparse matrix solvers , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[4]  J. Demmel,et al.  Implementing Communication-Optimal Parallel and Sequential QR Factorizations , 2008, 0809.2407.

[5]  Jack Dongarra,et al.  Enhancing Parallelism of Tile QR Factorization for Multicore Architectures , 2010 .

[6]  James Demmel,et al.  LAPACK Users' Guide, Third Edition , 1999, Software, Environments and Tools.

[7]  Robert A. van de Geijn,et al.  Solving dense linear systems on platforms with multiple hardware accelerators , 2009, PPoPP '09.

[8]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Guilherme N. DeSouza,et al.  Adaptive learning of multi-subspace for foreground detection under illumination changes , 2011, Comput. Vis. Image Underst..

[10]  Jack Dongarra,et al.  LAPACK Users' guide (third ed.) , 1999 .

[11]  Eric J. Kelmelis,et al.  CULA: hybrid GPU accelerated linear algebra routines , 2010, Defense + Commercial Sensing.

[12]  Rita Cucchiara,et al.  ViSOR: VIdeo Surveillance On-line Repository for annotation retrieval , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[13]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Tamara G. Kolda,et al.  An overview of the Trilinos project , 2005, TOMS.

[15]  Mark A. Richards,et al.  QR decomposition on GPUs , 2009, GPGPU-2.

[16]  Xiaoming Yuan,et al.  Sparse and low-rank matrix decomposition via alternating direction method , 2013 .

[17]  James Demmel,et al.  LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs , 2008 .

[18]  James Demmel,et al.  Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..

[19]  Mark Hoemmen,et al.  Communication-avoiding Krylov subspace methods , 2010 .

[20]  Thomas Hérault,et al.  QR factorization of tall and skinny matrices in a grid computing environment , 2009, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[21]  James Demmel,et al.  Applied Numerical Linear Algebra , 1997 .

[22]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .