NVIDIA Tensor Core Programmability, Performance & Precision

The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called Tensor Core that performs one matrix-multiply-and-accumulate on 4x4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta microarchitecture, provides 640 Tensor Cores with a theoretical peak performance of 125 Tflops/s in mixed precision. In this paper, we investigate current approaches to program NVIDIA Tensor Cores, their performances and the precision loss due to computation in mixed precision. Currently, NVIDIA provides three different ways of programming matrix-multiply-and-accumulate on Tensor Cores: the CUDA Warp Matrix Multiply Accumulate (WMMA) API, CUTLASS, a templated library based on WMMA, and cuBLAS GEMM. After experimenting with different approaches, we found that NVIDIA Tensor Cores can deliver up to 83 Tflops/s in mixed precision on a Tesla V100 GPU, seven and three times the performance in single and half precision respectively. A WMMA implementation of batched GEMM reaches a performance of 4 Tflops/s. While precision loss due to matrix multiplication with half precision input might be critical in many HPC applications, it can be considerably reduced at the cost of increased computation. Our results indicate that HPC applications using matrix multiplications can strongly benefit from using of NVIDIA Tensor Cores.

[1]  Nicholas J. Higham,et al.  The Accuracy of Floating Point Summation , 1993, SIAM J. Sci. Comput..

[2]  Julien Langou,et al.  Mixed Precision Iterative Refinement Techniques for the Solution of Dense Linear Systems , 2007, Int. J. High Perform. Comput. Appl..

[3]  Luis A. Plana,et al.  SpiNNaker: Mapping neural networks onto a massively-parallel chip multiprocessor , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[4]  Jie Cheng,et al.  Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..

[5]  Alex Fit-Florea,et al.  Precision and Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs , 2011 .

[6]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7]  Yoshua Bengio,et al.  Training deep neural networks with low precision multiplications , 2014 .

[8]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[9]  Andrew S. Cassidy,et al.  A million spiking-neuron integrated circuit with a scalable communication network and interface , 2014, Science.

[10]  Yoshua Bengio,et al.  Low precision storage for deep learning , 2014 .

[11]  David Moloney,et al.  Always-on Vision Processing Unit for Mobile Applications , 2015, IEEE Micro.

[12]  Erwin Laure,et al.  OpenACC acceleration of the Nek5000 spectral element code , 2015, Int. J. High Perform. Comput. Appl..

[13]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[14]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[15]  Eric S. Chung,et al.  A reconfigurable fabric for accelerating large-scale datacenter services , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[16]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[17]  Michel Schanen,et al.  On the Strong Scaling of the Spectral Element Solver Nek5000 on Petascale Systems , 2016, EASC.

[18]  Alexander Heinecke,et al.  LIBXSMM: Accelerating Small Matrix Multiplications by Runtime Code Generation , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  Cris Cecka,et al.  Low Communication FMM-Accelerated FFT on GPUs , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[20]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[21]  Jack J. Dongarra,et al.  The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems , 2017, ICCS.

[22]  Jack J. Dongarra,et al.  Investigating half precision arithmetic to accelerate dense linear system solvers , 2017, ScalA@SC.

[23]  Jack J. Dongarra,et al.  Towards numerical benchmark for half-precision floating point arithmetic , 2017, 2017 IEEE High Performance Extreme Computing Conference (HPEC).

[24]  Xin Wang,et al.  Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks , 2017, NIPS.

[25]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.