Stability and Performance of Various Singular Value QR Implementations on Multicore CPU with a GPU

Singular Value QR (SVQR) can orthonormalize a set of dense vectors with the minimum communication (one global reduction between the parallel processing units, and BLAS-3 to perform most of its local computation). As a result, compared to other orthogonalization schemes, SVQR obtains superior performance on many of the current computers, where the communication has become significantly more expensive compared to the arithmetic operations. In this article, we study the stability and performance of various SVQR implementations on multicore CPUs with a GPU. Our focus is on the dense triangular solve, which performs half of the total floating-point operations of SVQR. As a part of this study, we examine an adaptive mixed-precision variant of SVQR, which decides if a lower-precision arithmetic can be used for the triangular solution at runtime without increasing the order of its orthogonality error (though its backward error is significantly greater). If the greater backward error can be tolerated, then our performance results with an NVIDIA Kepler GPU show that the mixed-precision SVQR can obtain a speedup of up to 1.36 over the standard SVQR.

[1]  Jack Dongarra,et al.  Autotuning GEMMs for Fermi , 2011 .

[2]  C. Bischof Incremental condition estimation , 1990 .

[3]  Kesheng Wu,et al.  A Block Orthogonalization Procedure with Constant Synchronization Requirements , 2000, SIAM J. Sci. Comput..

[4]  Jack J. Dongarra,et al.  Randomized algorithms to update partial singular value decomposition on a hybrid CPU/GPU cluster , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Jack J. Dongarra,et al.  Mixed-Precision Cholesky QR Factorization and Its Case Studies on Multicore CPU with Multiple GPUs , 2015, SIAM J. Sci. Comput..

[6]  James Demmel,et al.  Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..

[7]  D. Sorensen Numerical methods for large eigenvalue problems , 2002, Acta Numerica.

[8]  Jack J. Dongarra,et al.  Improving the Performance of CA-GMRES on Multicores with Multiple GPUs , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[9]  Y. Saad,et al.  Numerical Methods for Large Eigenvalue Problems , 2011 .

[10]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[11]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[12]  John Van Rosendale Minimizing Inner Product Data Dependencies in Conjugate Gradient Iteration , 1983, ICPP.

[13]  Y. Saad Numerical Methods for Large Eigenvalue Problems , 2011 .

[14]  Mark Hoemmen,et al.  Communication-avoiding Krylov subspace methods , 2010 .

[15]  Sivasankaran Rajamanickam,et al.  Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on a Hybrid CPU/GPU Cluster , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Julien Langou,et al.  Mixed Precision Iterative Refinement Techniques for the Solution of Dense Linear Systems , 2007, Int. J. High Perform. Comput. Appl..

[17]  Jack Dongarra,et al.  Matrix Algebra for GPU and Multicore Architectures (MAGMA) for Large Petascale Systems , 2014 .