Rounding error analysis of mixed precision block Householder QR algorithms.

Although mixed precision arithmetic has recently garnered interest for training dense neural networks, many other applications could benefit from the speed-ups and lower storage if applied appropriately. The growing interest in employing mixed precision computations motivates the need for rounding error analysis that properly handles behavior from mixed precision arithmetic. We develop mixed precision variants of existing Householder QR algorithms and show error analyses supported by numerical experiments.

[1]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[2]  Nicholas J. Higham,et al.  Mixed Precision Block Fused Multiply-Add: Error Analysis and Application to GPU Tensor Cores , 2020, SIAM J. Sci. Comput..

[3]  C. Loan,et al.  A Storage-Efficient $WY$ Representation for Products of Householder Transformations , 1989 .

[4]  Jack J. Dongarra,et al.  The Design of Fast and Energy-Efficient Linear Solvers: On the Potential of Half-Precision Arithmetic and Iterative Refinement Techniques , 2018, ICCS.

[5]  Ilse C. F. Ipsen,et al.  Probabilistic Error Analysis for Inner Products , 2019, SIAM J. Matrix Anal. Appl..

[6]  Nicholas J. Higham,et al.  Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  Yoshua Bengio,et al.  Training deep neural networks with low precision multiplications , 2014 .

[8]  Yoshua Bengio,et al.  Low precision storage for deep learning , 2014 .

[9]  Nicholas J. Higham,et al.  A New Approach to Probabilistic Rounding Error Analysis , 2019, SIAM J. Sci. Comput..

[10]  Nicholas J. Higham,et al.  Simulating Low Precision Floating-Point Arithmetic , 2019, SIAM J. Sci. Comput..

[11]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[12]  James Demmel,et al.  Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..

[13]  Jack Dongarra,et al.  Fast Batched Matrix Multiplication for Small Sizes Using Half-Precision Arithmetic on GPUs , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[14]  Yusaku Yamamoto,et al.  Backward error analysis of the AllReduce algorithm for householder QR decomposition , 2011, Japan Journal of Industrial and Applied Mathematics.

[15]  Gene H. Golub,et al.  Matrix computations , 1983 .

[16]  Alston S. Householder,et al.  Unitary Triangularization of a Nonsymmetric Matrix , 1958, JACM.

[17]  James Demmel,et al.  Fast linear algebra is stable , 2006, Numerische Mathematik.

[18]  Christian H. Bischof,et al.  The WY representation for products of householder matrices , 1985, PPSC.