Mixed-precision block gram Schmidt orthogonalization

The mixed-precision Cholesky QR (CholQR) can orthogonalize the columns of a dense matrix with the minimum communication cost. Moreover, its orthogonality error depends only linearly to the condition number of the input matrix. However, when the desired higher-precision is not supported by the hardware, the software-emulated arithmetics are needed, which could significantly increase its computational cost. When there are a large number of columns to be orthogonalized, this computational overhead can have a dramatic impact on the orthogonalization time, and the mixed-precision CholQR can be much slower than the standard CholQR. In this paper, we examine several block variants of the algorithm, which reduce the computational overhead associated with the software-emulated arithmetics, while maintaining the same orthogonality error bound as the mixed-precision CholQR. Our numerical and performance results on multicore CPUs with a GPU, as well as a hybrid CPU/GPU cluster, demonstrate that compared to the mixed-precision CholQR, such a block variant can obtain speedups of up to 7.1× while maintaining about the same order of the numerical errors.

[1]  Kesheng Wu,et al.  A Block Orthogonalization Procedure with Constant Synchronization Requirements , 2000, SIAM J. Sci. Comput..

[2]  Jack J. Dongarra,et al.  Randomized algorithms to update partial singular value decomposition on a hybrid CPU/GPU cluster , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Jack J. Dongarra,et al.  Mixed-Precision Cholesky QR Factorization and Its Case Studies on Multicore CPU with Multiple GPUs , 2015, SIAM J. Sci. Comput..

[4]  Sivasankaran Rajamanickam,et al.  Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on a Hybrid CPU/GPU Cluster , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Jesse L. Barlow,et al.  Reorthogonalized block classical Gram–Schmidt , 2011, Numerische Mathematik.

[6]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[7]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[8]  Jack J. Dongarra,et al.  Stability and Performance of Various Singular Value QR Implementations on Multicore CPU with a GPU , 2016, ACM Trans. Math. Softw..

[9]  Michael W. Mahoney Randomized Algorithms for Matrices and Data , 2011, Found. Trends Mach. Learn..

[10]  D. Sorensen Numerical methods for large eigenvalue problems , 2002, Acta Numerica.

[11]  Y. Saad Numerical Methods for Large Eigenvalue Problems , 2011 .

[12]  Mark Hoemmen,et al.  Communication-avoiding Krylov subspace methods , 2010 .

[13]  Å. Björck Solving linear least squares problems by Gram-Schmidt orthogonalization , 1967 .

[14]  W. Jalbyf,et al.  STABILITY ANALYSIS AND IMPROVEMENT OF THE BLOCK GRAM-SCHMIDT ALGORITHM , .

[15]  Xiaoye S. Li,et al.  Quad-Double Arithmetic: Algorithms, Implementation, and Application∗ , 2007 .

[16]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..