Batched QR and SVD Algorithms on GPUs with Applications in Hierarchical Matrix Compression

High performance GPU hosted batched QR decomposition kernels are developed and outperform current implementations for small and rectangular matrices.Various GPU hosted batched singular value decomposition kernels are developed and used as building blocks of a batched randomized SVD kernel for numerically low rank matrix blocks.Batched QR, SVD, and GEMM kernels are used to compress hierarchical matrices entirely on the GPU. We present high performance implementations of the QR and the singular value decomposition of a batch of small matrices hosted on the GPU with applications in the compression of hierarchical matrices. The one-sided Jacobi algorithm is used for its simplicity and inherent parallelism as a building block for the SVD of low rank blocks using randomized methods. We implement multiple kernels based on the level of the GPU memory hierarchy in which the matrices can reside and show substantial speedups against streamed cuSOLVER SVDs. The resulting batched routine is a key component of hierarchical matrix compression, opening up opportunities to perform H-matrix arithmetic efficiently on GPUs.

[1]  Jack J. Dongarra,et al.  Scheduling dense linear algebra operations on multicore processors , 2010, Concurr. Comput. Pract. Exp..

[2]  Jacob Barhen,et al.  Singular value decomposition utilizing parallel algorithms on graphical processors , 2011, OCEANS'11 MTS/IEEE KONA.

[3]  Max Grossman,et al.  Professional CUDA C Programming , 2014 .

[4]  Gabriel Oksa,et al.  Efficient pre-processing in the parallel block-Jacobi SVD algorithm , 2006, Parallel Comput..

[5]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[6]  James Demmel,et al.  Communication-Avoiding QR Decomposition for GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[7]  HackbuschW. A sparse matrix arithmetic based on H-matrices. Part I , 1999 .

[8]  Richard P. Brent,et al.  A Parallel Ring Ordering Algorithm for Efficient One-Sided Jacobi SVD Computations , 1997, J. Parallel Distributed Comput..

[9]  James Demmel,et al.  Jacobi's Method is More Accurate than QR , 1989, SIAM J. Matrix Anal. Appl..

[10]  Wolfgang Hackbusch,et al.  A Sparse Matrix Arithmetic Based on H-Matrices. Part I: Introduction to H-Matrices , 1999, Computing.

[11]  Marián Vajtersic,et al.  Block-jacobi Svd Algorithms for Distributed Memory Systems Ii: Meshes* , 1999, Parallel Algorithms Appl..

[12]  Marián Vajtersic,et al.  Block-jacobi Svd Algorithms for Distributed Memory Systems I: Hypercubes and Rings , 1999, Parallel Algorithms Appl..

[13]  Nicholas Wilt,et al.  The CUDA Handbook: A Comprehensive Guide to GPU Programming , 2013 .

[14]  Philipp Birken,et al.  Numerical Linear Algebra , 2011, Encyclopedia of Parallel Computing.

[15]  Hatem Ltaief,et al.  Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs , 2019, ACM Trans. Math. Softw..

[16]  Richard P. Brent,et al.  On parallel implementation of the one-sided Jacobi algorithm for singular value decompositions , 1995, Proceedings Euromicro Workshop on Parallel and Distributed Processing.

[17]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[18]  Jack J. Dongarra,et al.  Optimization for performance and energy for batched matrix computations on GPUs , 2015, GPGPU@PPoPP.

[19]  Jack J. Dongarra,et al.  Dense linear algebra solvers for multicore with GPU accelerators , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[20]  Jack J. Dongarra,et al.  A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations , 2015, ISC.

[21]  Steffen Börm,et al.  Approximating Gaussian Processes with H2-Matrices , 2007, ECML.

[22]  W. Hackbusch,et al.  Hierarchical Matrices: Algorithms and Analysis , 2015 .

[23]  Gene H. Golub,et al.  Matrix computations , 1983 .

[24]  Martin Bečka,et al.  New Dynamic Orderings for the Parallel One-Sided Block-Jacobi SVD Algorithm , 2015, Parallel Process. Lett..

[25]  Che-Rung Lee,et al.  Improving Performance of Convolutional Neural Networks by Separable Filters on GPU , 2015, Euro-Par.

[26]  W. Hackbusch,et al.  On H2-Matrices , 2000 .

[27]  Boris N. Khoromskij,et al.  A Sparse H-Matrix Arithmetic. Part II: Application to Multi-Dimensional Problems , 2000, Computing.

[28]  Luciano de Paula,et al.  Many SVDs on GPU for Image Mosaic Assemble , 2015, 2015 International Symposium on Computer Architecture and High Performance Computing Workshop (SBAC-PADW).

[29]  Wolfgang Hackbusch,et al.  Construction and Arithmetics of H-Matrices , 2003, Computing.