论文信息 - Communication-Avoiding QR Decomposition for GPUs

Communication-Avoiding QR Decomposition for GPUs

We describe an implementation of the Communication-Avoiding QR (CAQR) factorization that runs entirely on a single graphics processor (GPU). We show that the reduction in memory traffic provided by CAQR allows us to outperform existing parallel GPU implementations of QR for a large class of tall-skinny matrices. Other GPU implementations of QR handle panel factorizations by either sending the work to a general-purpose processor or using entirely bandwidth-bound operations, incurring data transfer overheads. In contrast, our QR is done entirely on the GPU using compute-bound kernels, meaning performance is good regardless of the width of the matrix. As a result, we outperform CULA, a parallel linear algebra library for GPUs by up to 17x for tall-skinny matrices and Intel's Math Kernel Library (MKL) by up to 12x. We also discuss stationary video background subtraction as a motivating application. We apply a recent statistical approach, which requires many iterations of computing the singular value decomposition of a tall-skinny matrix. Using CAQR as a first step to getting the singular value decomposition, we are able to get the answer 3x faster than if we use a traditional bandwidth-bound GPU QR factorization tuned specifically for that matrix size, and 30x faster than if we use Intel's Math Kernel Library (MKL) singular value decomposition routine on a multicore CPU.

[1] Yi Ma,et al. Robust principal component analysis? , 2009, JACM.

[2] D. Donoho,et al. Maximal Sparsity Representation via l 1 Minimization , 2002 .

[3] James Demmel,et al. Minimizing communication in sparse matrix solvers , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[4] J. Demmel,et al. Implementing Communication-Optimal Parallel and Sequential QR Factorizations , 2008, 0809.2407.

[5] Jack Dongarra,et al. Enhancing Parallelism of Tile QR Factorization for Multicore Architectures , 2010 .

[6] James Demmel,et al. LAPACK Users' Guide, Third Edition , 1999, Software, Environments and Tools.

[7] Robert A. van de Geijn,et al. Solving dense linear systems on platforms with multiple hardware accelerators , 2009, PPoPP '09.

[8] Samuel Williams,et al. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[9] Guilherme N. DeSouza,et al. Adaptive learning of multi-subspace for foreground detection under illumination changes , 2011, Comput. Vis. Image Underst..

[10] Jack Dongarra,et al. LAPACK Users' guide (third ed.) , 1999 .

[11] Eric J. Kelmelis,et al. CULA: hybrid GPU accelerated linear algebra routines , 2010, Defense + Commercial Sensing.

[12] Rita Cucchiara,et al. ViSOR: VIdeo Surveillance On-line Repository for annotation retrieval , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[13] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[14] Tamara G. Kolda,et al. An overview of the Trilinos project , 2005, TOMS.

[15] Mark A. Richards,et al. QR decomposition on GPUs , 2009, GPGPU-2.

[16] Xiaoming Yuan,et al. Sparse and low-rank matrix decomposition via alternating direction method , 2013 .

[17] James Demmel,et al. LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs , 2008 .

[18] James Demmel,et al. Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..

[19] Mark Hoemmen,et al. Communication-avoiding Krylov subspace methods , 2010 .

[20] Thomas Hérault,et al. QR factorization of tall and skinny matrices in a grid computing environment , 2009, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[21] James Demmel,et al. Applied Numerical Linear Algebra , 1997 .

[22] Jack Dongarra,et al. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .