论文信息 - Mixed-precision block gram Schmidt orthogonalization

Mixed-precision block gram Schmidt orthogonalization

The mixed-precision Cholesky QR (CholQR) can orthogonalize the columns of a dense matrix with the minimum communication cost. Moreover, its orthogonality error depends only linearly to the condition number of the input matrix. However, when the desired higher-precision is not supported by the hardware, the software-emulated arithmetics are needed, which could significantly increase its computational cost. When there are a large number of columns to be orthogonalized, this computational overhead can have a dramatic impact on the orthogonalization time, and the mixed-precision CholQR can be much slower than the standard CholQR. In this paper, we examine several block variants of the algorithm, which reduce the computational overhead associated with the software-emulated arithmetics, while maintaining the same orthogonality error bound as the mixed-precision CholQR. Our numerical and performance results on multicore CPUs with a GPU, as well as a hybrid CPU/GPU cluster, demonstrate that compared to the mixed-precision CholQR, such a block variant can obtain speedups of up to 7.1× while maintaining about the same order of the numerical errors.

Jack J. Dongarra | Stanimire Tomov | Ichitaro Yamazaki | Jakub Kurzak | Jesse L. Barlow

[1] Kesheng Wu,et al. A Block Orthogonalization Procedure with Constant Synchronization Requirements , 2000, SIAM J. Sci. Comput..

[2] Jack J. Dongarra,et al. Randomized algorithms to update partial singular value decomposition on a hybrid CPU/GPU cluster , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[3] Jack J. Dongarra,et al. Mixed-Precision Cholesky QR Factorization and Its Case Studies on Multicore CPU with Multiple GPUs , 2015, SIAM J. Sci. Comput..

[4] Sivasankaran Rajamanickam,et al. Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on a Hybrid CPU/GPU Cluster , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[5] Jesse L. Barlow,et al. Reorthogonalized block classical Gram–Schmidt , 2011, Numerische Mathematik.

[6] Yousef Saad,et al. Iterative methods for sparse linear systems , 2003 .

[7] Robert H. Halstead,et al. Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[8] Jack J. Dongarra,et al. Stability and Performance of Various Singular Value QR Implementations on Multicore CPU with a GPU , 2016, ACM Trans. Math. Softw..

[9] Michael W. Mahoney. Randomized Algorithms for Matrices and Data , 2011, Found. Trends Mach. Learn..

[10] D. Sorensen. Numerical methods for large eigenvalue problems , 2002, Acta Numerica.

[11] Y. Saad. Numerical Methods for Large Eigenvalue Problems , 2011 .

[12] Mark Hoemmen,et al. Communication-avoiding Krylov subspace methods , 2010 .

[13] Å. Björck. Solving linear least squares problems by Gram-Schmidt orthogonalization , 1967 .

[14] W. Jalbyf,et al. STABILITY ANALYSIS AND IMPROVEMENT OF THE BLOCK GRAM-SCHMIDT ALGORITHM , .

[15] Xiaoye S. Li,et al. Quad-Double Arithmetic: Algorithms, Implementation, and Application∗ , 2007 .

[16] Nathan Halko,et al. Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..