Numerical behavior of NVIDIA tensor cores

We explore the floating-point arithmetic implemented in NVIDIA tensor cores, which are hardware accelerators for mixed-precision matrix multiplication available on the Volta, Turing, and Ampere microarchitectures. Using Volta V100 and Turing T4 graphics cards, we determine what precision is used for the intermediate results, whether subnormal numbers are supported, what rounding mode is used, in which order the operations underlying the matrix multiplication are performed, and whether partial sums are normalized. These aspects are not documented by NVIDIA, and we gain insight by running carefully designed numerical experiments on these hardware units. Knowing the answers to these questions is important if one wishes to: 1) accurately simulate NVIDIA tensor cores on conventional hardware; 2) understand the differences between results produced by code that utilizes tensor cores and code that uses only IEEE 754-compliant arithmetic operations; and 3) build hardware that computes a matrix-matrix product matching the results of the NVIDIA tensor cores. As part of this work we provide a testsuite that can be easily adapted to test the latest tensor cores available in the NVIDIA Ampere A100, once those graphics cards become easily accessible. Moreover, we identify a non-monotonicity issue that arises in floating-point multi-operand addition if the intermediate results are not normalized.

[1]  Daichi Mukunoki,et al.  DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions , 2020, ISC.

[2]  Wolfgang J. Paul,et al.  System Architecture , 2016, Springer International Publishing.

[3]  Nicholas J. Higham,et al.  Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Earl E. Swartzlander,et al.  A Fused Floating-Point Four-Term Dot Product Unit , 2016, IEEE Transactions on Circuits and Systems I: Regular Papers.

[5]  Nicholas J. Higham,et al.  Mixed-Precision Solution of Linear Systems Using Accelerator-Based Computing , 2020 .

[6]  Alexandre F. Tenca,et al.  Multi-operand Floating-Point Addition , 2009, 2009 19th IEEE Symposium on Computer Arithmetic.

[7]  Nicholas J. Higham,et al.  A Survey of Numerical Methods Utilizing Mixed Precision Arithmetic , 2020, ArXiv.

[8]  Jean-Michel Muller,et al.  Handbook of Floating-Point Arithmetic (2nd Ed.) , 2018 .

[9]  Sanu Mathew,et al.  Optimized Fused Floating-Point Many-Term Dot-Product Hardware for Machine Learning Accelerators , 2019, 2019 IEEE 26th Symposium on Computer Arithmetic (ARITH).

[10]  Jean-Michel Muller,et al.  Handbook of Floating-Point Arithmetic (2nd Ed.) , 2018 .

[11]  Nicholas J. Higham,et al.  Mixed Precision Block Fused Multiply-Add: Error Analysis and Application to GPU Tensor Cores , 2020, SIAM J. Sci. Comput..

[12]  Jeffrey S. Vetter,et al.  NVIDIA Tensor Core Programmability, Performance & Precision , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[13]  Marco Maggioni,et al.  Dissecting the NVidia Turing T4 GPU via Microbenchmarking , 2019, ArXiv.

[14]  Mark Horowitz,et al.  Rounding algorithms for IEEE multipliers , 1989, Proceedings of 9th Symposium on Computer Arithmetic.

[15]  Xiaowen Chu,et al.  Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply , 2020, 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[16]  Lee-Sup Kim,et al.  A Floating-Point Unit for 4D Vector Inner Product with Reduced Latency , 2009, IEEE Transactions on Computers.

[17]  James Demmel,et al.  IEEE Standard for Floating-Point Arithmetic , 2008 .

[18]  Paolo Rech,et al.  Impact of Tensor Cores and Mixed Precision on the Reliability of Matrix Multiplication in GPUs , 2020, IEEE Transactions on Nuclear Science.

[19]  Marco Maggioni,et al.  Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking , 2018, ArXiv.

[20]  David Seal,et al.  ARM Architecture Reference Manual , 2001 .

[21]  Nicholas J. Higham,et al.  Mixed-precision iterative refinement using tensor cores on GPUs to accelerate solution of linear systems , 2020, Proceedings of the Royal Society A.

[22]  Jack J. Dongarra,et al.  The Design of Fast and Energy-Efficient Linear Solvers: On the Potential of Half-Precision Arithmetic and Iterative Refinement Techniques , 2018, ICCS.

[23]  S. Morrison,et al.  A Rapid and Economic In-House DNA Purification Method Using Glass Syringe Filters , 2009, PloS one.

[24]  Brian J. Hickmann,et al.  Experimental Analysis of Matrix Multiplication Functional Units , 2019, 2019 IEEE 26th Symposium on Computer Arithmetic (ARITH).

[25]  David J. Harper,et al.  Paranoia , 2009, The Harvard mental health letter.

[26]  Tao Yao,et al.  Correctly rounded architectures for Floating-Point multi-operand addition and dot-product computation , 2013, 2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors.

[27]  Milos D. Ercegovac,et al.  Digital Arithmetic , 2003, Wiley Encyclopedia of Computer Science and Engineering.