A Hierarchically Blocked Jacobi SVD Algorithm for Single and Multiple Graphics Processing Units

We present a hierarchically blocked one-sided Jacobi algorithm for the singular value decomposition (SVD), targeting both single and multiple graphics processing units (GPUs). The blocking structure reflects the levels of the GPUs' memory hierarchy. The algorithm may outperform MAGMA's \textttdgesvd, while retaining high relative accuracy. To this end, we developed a family of parallel pivot strategies on the GPU's shared address space, but applicable also to inter-GPU communication. Unlike common hybrid approaches, our algorithm in a single-GPU setting needs a CPU for the controlling purposes only, while utilizing the GPU's resources to the fullest extent permitted by the hardware. When required by the problem size, the algorithm, in principle, scales to an arbitrary number of GPU nodes. The scalability is demonstrated by more than twofold speedup for sufficiently large matrices on a Tesla S2050 system with four GPUs versus a single Fermi card.

[1]  Sanja Singer,et al.  Full block J-Jacobi method for Hermitian matrices , 2014 .

[2]  Gabriel Oksa,et al.  Dynamic ordering for a parallel block-Jacobi SVD algorithm , 2002, Parallel Comput..

[3]  P. Eberlein On one-sided Jacobi methods for parallel computation , 1987 .

[4]  James Demmel,et al.  Fast Reproducible Floating-Point Summation , 2013, 2013 IEEE 21st Symposium on Computer Arithmetic.

[5]  R. Schreiber,et al.  On the convergence of the cyclic Jacobi method for parallel block orderings , 1989 .

[6]  Vjeran Hari,et al.  Block-oriented J-Jacobi methods for Hermitian matrices , 2010 .

[7]  Zlatko Drmac,et al.  New Fast and Accurate Jacobi SVD Algorithm. I , 2007, SIAM J. Matrix Anal. Appl..

[8]  P. J. Narayanan,et al.  Singular value decomposition on GPU using CUDA , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[9]  R. Brent,et al.  Almost linear-time computation of the singular value decomposition using mesh-connected processors , 1983 .

[10]  Mihalis Yannakakis,et al.  On Generating All Maximal Independent Sets , 1988, Inf. Process. Lett..

[11]  Franklin T. Luk,et al.  On parallel Jacobi orderings , 1989 .

[12]  Z. Drmač A posteriori computation of the singular vectors in a preconditioned Jacobi SVD algorithm , 1999 .

[13]  M. Hestenes Inversion of Matrices by Biorthogonalization and Related Results , 1958 .

[14]  K. Veselié A Jacobi eigenreduction algorithm for definite matrix pairs , 1993 .

[15]  Walter F. Mascarenhas,et al.  On the Convergence of the Jacobi Method for Arbitrary Orderings , 1995, SIAM J. Matrix Anal. Appl..

[16]  Jack J. Dongarra,et al.  Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing , 2010, Parallel Comput..

[17]  Zlatko Drmac,et al.  Implementation of Jacobi Rotations for Accurate Singular Value Computation in Floating Point Arithmetic , 1997, SIAM J. Sci. Comput..

[18]  Haesun Park,et al.  Fast Plane Rotations with Dynamic Scaling , 1994, SIAM J. Matrix Anal. Appl..

[19]  Ivan Slapničar,et al.  Componentwise Analysis of Direct Factorization of Real Symmetric and Hermitian Matrices , 1998 .

[20]  Vedran Novakovic,et al.  Novel modifications of parallel Jacobi algorithms , 2011, Numerical Algorithms.

[21]  Froilán M. Dopico,et al.  Implicit standard Jacobi gives high relative accuracy , 2009, Numerische Mathematik.

[22]  Herman H. Goldstine,et al.  The Jacobi Method for Real Symmetric Matrices , 1959, JACM.

[23]  H. Zha A note on the existence of the hyperbolic singular value decomposition , 1996 .

[24]  James Demmel,et al.  Communication-Avoiding QR Decomposition for GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[25]  James Demmel,et al.  Jacobi's Method is More Accurate than QR , 1989, SIAM J. Matrix Anal. Appl..

[26]  A. Sameh On Jacobi and Jacobi-like algorithms for a parallel computer , 1971 .

[27]  Franklin T. Luk,et al.  A Proof of Convergence for Two Parallel Jacobi SVD Algorithms , 1989, IEEE Trans. Computers.

[28]  R. Brent,et al.  The Solution of Singular-Value and Symmetric Eigenvalue Problems on Multiprocessor Arrays , 1985 .

[29]  Vedran Novakovic,et al.  A GPU-based hyperbolic SVD algorithm , 2010, ArXiv.

[30]  James Demmel,et al.  Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..

[31]  Vedran Novakovic,et al.  Three-level parallel J-Jacobi algorithms for Hermitian matrices , 2010, Appl. Math. Comput..

[32]  Eldon R. Hansen,et al.  On Cyclic Jacobi Methods , 1963 .

[33]  Allan O. Steinhardt,et al.  The hyberbolic singular value decomposition and applications , 1990, Fifth ASSP Workshop on Spectrum Estimation and Modeling.

[34]  Lynn Elliot Cannon,et al.  A cellular computer to implement the kalman filter algorithm , 1969 .

[35]  Patricia J. Eberlein,et al.  Block Recursive Algorithm to Generate Jacobi-Sets , 1993, Parallel Comput..

[36]  C. Loan The Block Jacobi Method for Computing the Singular Value Decomposition , 1985 .