Towards Half-Precision Computation for Complex Matrices: A Case Study for Mixed Precision Solvers on GPUs

The use of low-precision computations is popular in accelerating machine learning and artificial intelligence (AI) applications. Hardware architectures, such as high-end graphics processing units (GPUs), now support native 16-bit floating-point arithmetic (i.e., half-precision). While half precision provides a natural 2×/4× speedup against the performance of single/double precisions, respectively, modern GPUs are equipped with hard- ware accelerators that further boost the FP16 performance. These accelerators, known as tensor cores (TCs), have a theoretical peak performance that is 8×/16× faster than FP32/FP64 performance, respectively. Such a high level of performance has encouraged researchers to harness the compute power of TCs outside AI applications. This paper presents a mixed-precision dense linear solver (Ax = b) for complex matrices using the GPU’s TC units. Unlike similar efforts that have discussed accelerating Ax = b in real FP16 arithmetic, this paper focuses on complex FP16 precisions. The developed solution uses a “half-complex” pre- cision to accelerate the solution of Ax = b while maintaining complex FP32 precision accuracy. The proposed solver requires the development of a high-performance mixed-precision ma- trix multiplication (CGEMM-FP16) that accepts half-complex inputs, and uses the TCs’ full-precision products and FP32 accumulations for the computation. We discuss two designs and their performance. Similar to the way fast GEMMs power the performance of LAPACK, the mixed-precision CGEMM- FP16 can enable the development of mixed-precision LAPACK algorithms. We illustrate this by integrating both CGEMM-FP16s into the development of mixed-precision LU factorizations of complex matrices. Finally, an iterative refinement solver is used to deliver complex FP32 accuracy using a preconditioned GMRES solver. Our experiments, conducted on V100 GPUs, show that the mixed-precision solver can be up to 2.5× faster than a full single-complex precision solver.

[1]  Jack J. Dongarra,et al.  An Improved Magma Gemm For Fermi Graphics Processing Units , 2010, Int. J. High Perform. Comput. Appl..

[2]  Y. Saad,et al.  GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems , 1986 .

[3]  Stanimire Tomov,et al.  Analysis and Design Techniques towards High-Performance and Energy-Efficient Dense Linear Solvers on GPUs , 2018, IEEE Transactions on Parallel and Distributed Systems.

[4]  Nicholas J. Higham,et al.  Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[6]  Nicholas J. Higham,et al.  Accelerating the Solution of Linear Systems by Iterative Refinement in Three Precisions , 2018, SIAM J. Sci. Comput..

[7]  André Seznec,et al.  Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[8]  Nicholas J. Higham,et al.  A New Analysis of Iterative Refinement and Its Application to Accurate Solution of Ill-Conditioned Sparse Linear Systems , 2017, SIAM J. Sci. Comput..

[9]  Jack J. Dongarra,et al.  Performance, Design, and Autotuning of Batched GEMM for GPUs , 2016, ISC.

[10]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[11]  Jack Dongarra,et al.  Fast Batched Matrix Multiplication for Small Sizes Using Half-Precision Arithmetic on GPUs , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[12]  James Demmel,et al.  IEEE Standard for Floating-Point Arithmetic , 2008 .

[13]  J. Dongarra,et al.  Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy (Revisiting Iterative Refinement for Linear Systems) , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[14]  Yousef Saad,et al.  A Flexible Inner-Outer Preconditioned GMRES Algorithm , 1993, SIAM J. Sci. Comput..

[15]  Julien Langou,et al.  Accelerating scientific computations with mixed precision algorithms , 2008, Comput. Phys. Commun..

[16]  Nicholas J. Higham,et al.  Squeezing a Matrix into Half Precision, with an Application to Solving Linear Systems , 2019, SIAM J. Sci. Comput..

[17]  Valeria Simoncini,et al.  Flexible Inner-Outer Krylov Subspace Methods , 2002, SIAM J. Numer. Anal..