论文信息 - Mixed-precision orthogonalization scheme and adaptive step size for CA-GMRES on GPUs

Mixed-precision orthogonalization scheme and adaptive step size for CA-GMRES on GPUs

We propose a mixed-precision orthogonalization scheme that takes the input matrix in a standard 64-bit floating-point precision, but accumulates its intermediate results in the doubled-precision. When the target hardware does not support the desired higher precision, we use software emulation. Compared with the standard orthogonalization scheme, we require about 8.5× more computation but a much smaller increase in communication. Since the computation is becoming less expensive compared to the communication on new and emerging architectures, the relative cost of our mixed-precision scheme is decreasing. Our case studies with CA-GMRES on a GPU demonstrate that using mixed-precision for this small but critical segment of CA-GMRES can improve not only its overall numerical stability but also, in some cases, its performance. We also study an adaptive scheme to dynamically adjust the step size of the matrix powers kernel. Our experiments on multiple GPUs show that a near optimal step size can be chosen based on the performance measurements from the first restart loop of CA-GMRES.

[1] Kesheng Wu,et al. A Block Orthogonalization Procedure with Constant Synchronization Requirements , 2000, SIAM J. Sci. Comput..

[2] Y. Saad,et al. GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems , 1986 .

[3] Jack J. Dongarra,et al. Improving the Performance of CA-GMRES on Multicores with Multiple GPUs , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[4] Mark Hoemmen,et al. Communication-avoiding Krylov subspace methods , 2010 .

[5] Jack Dongarra,et al. Mixed-precision orthogonalization scheme and its case studies with CA-GMRES on a GPU , 2014 .

[6] L. Reichel,et al. A Newton basis GMRES implementation , 1994 .

[7] Xiaoye S. Li,et al. Quad-Double Arithmetic: Algorithms, Implementation, and Application∗ , 2007 .