A High Performance QDWH-SVD Solver Using Hardware Accelerators

This article describes a new high performance implementation of the QR-based Dynamically Weighted Halley Singular Value Decomposition (QDWH-SVD) solver on multicore architecture enhanced with multiple GPUs. The standard QDWH-SVD algorithm was introduced by Nakatsukasa and Higham (SIAM SISC, 2013) and combines three successive computational stages: (1) the polar decomposition calculation of the original matrix using the QDWH algorithm, (2) the symmetric eigendecomposition of the resulting polar factor to obtain the singular values and the right singular vectors, and (3) the matrix-matrix multiplication to get the associated left singular vectors. A comprehensive test suite highlights the numerical robustness of the QDWH-SVD solver. Although it performs up to two times more flops when computing all singular vectors compared to the standard SVD solver algorithm, our new high performance implementation on single GPU results in up to 4× improvements for asymptotic matrix sizes, compared to the equivalent routines from existing state-of-the-art open-source and commercial libraries. However, when only singular values are needed, QDWH-SVD is penalized by performing more flops by an order of magnitude. The singular value only implementation of QDWH-SVD on single GPU can still run up to 18% faster than the best existing equivalent routines.

[1]  Jack J. Dongarra,et al.  Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[2]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[3]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[4]  Jack J. Dongarra,et al.  Solving the Generalized Symmetric Eigenvalue Problem using Tile Algorithms on Multicore Architectures , 2011, PARCO.

[5]  L. Trefethen,et al.  Numerical linear algebra , 1997 .

[6]  P. Hansen Rank-Deficient and Discrete Ill-Posed Problems: Numerical Aspects of Linear Inversion , 1987 .

[7]  Jack J. Dongarra,et al.  A novel hybrid CPU–GPU generalized eigensolver for electronic structure calculations based on fine-grained memory aware tasks , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[8]  B. Parlett,et al.  Block reflectors: theory and computation , 1988 .

[9]  Stanley C. Eisenstat,et al.  A Divide-and-Conquer Algorithm for the Bidiagonal SVD , 1995, SIAM J. Matrix Anal. Appl..

[10]  James Demmel,et al.  Accurate Singular Values of Bidiagonal Matrices , 1990, SIAM J. Sci. Comput..

[11]  Jack J. Dongarra,et al.  Parallel Two-Sided Matrix Reduction to Band Bidiagonal Form on Multicore Architectures , 2010, IEEE Transactions on Parallel and Distributed Systems.

[12]  Nicholas J. Higham,et al.  Parallel Singular Value Decomposition via the Polar Decomposition , 2006 .

[13]  P. Schönemann,et al.  A generalized solution of the orthogonal procrustes problem , 1966 .

[14]  James Demmel,et al.  Minimizing Communication for Eigenproblems and the Singular Value Decomposition , 2010, ArXiv.

[15]  Jerome A. Goldstein,et al.  Linear algebra and quantum chemistry , 1991 .

[16]  B. Parlett,et al.  Accurate singular values and differential qd algorithms , 1994 .

[17]  Jack Dongarra,et al.  QUARK Users' Guide: QUeueing And Runtime for Kernels , 2011 .

[18]  Nicholas J. Higham,et al.  Stable and Efficient Spectral Divide and Conquer Algorithms for the Symmetric Eigenvalue Decomposition and the SVD , 2013, SIAM J. Sci. Comput..

[19]  Piotr Luszczek,et al.  An improved parallel singular value algorithm and its implementation for multicore hardware , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[20]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[21]  Wilfred Pinfold,et al.  Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis , 2009, HiPC 2009.

[22]  Gene H. Golub,et al.  Singular value decomposition and least squares solutions , 1970, Milestones in Matrix Computation.

[23]  W. Kahan,et al.  Computing small singular values of bidiagonal matrices with guaranteed high relative accuracy: LAPACK working note number 3 , 1988 .

[24]  Jack J. Dongarra,et al.  A novel hybrid CPU–GPU generalized eigensolver for electronic structure calculations based on fine-grained memory aware tasks , 2014, Int. J. High Perform. Comput. Appl..

[25]  Itzhack Bar-itzhack,et al.  Iterative Optimal Orthogonalization of the Strapdown Matrix , 1975, IEEE Transactions on Aerospace and Electronic Systems.

[26]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[27]  Bruno Lang Efficient eigenvalue and singular value computations on shared memory machines , 1999, Parallel Comput..

[28]  K. S. Arun,et al.  A Unitarily Constrained Total Least Squares Problem in Signal Processing , 1992, SIAM J. Matrix Anal. Appl..

[29]  Zhaojun Bai,et al.  Optimizing Halley's Iteration for Computing the Matrix Polar Decomposition , 2010, SIAM J. Matrix Anal. Appl..

[30]  Jack J. Dongarra,et al.  Two-Stage Tridiagonal Reduction for Dense Symmetric Matrices Using Tile Algorithms on Multicore Architectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[31]  Christian H. Bischof,et al.  Algorithm 807: The SBR Toolbox—software for successive band reduction , 2000, TOMS.

[32]  Jack Dongarra,et al.  Parallel Band Two-Sided MatrixBidiagonalization for Multicore Architectures , 2009 .

[33]  Philipp Birken,et al.  Numerical Linear Algebra , 2011, Encyclopedia of Parallel Computing.