论文信息 - Benchmarking the NVIDIA V100 GPU and Tensor Cores

Benchmarking the NVIDIA V100 GPU and Tensor Cores

The V100 GPU is the newest server-grade GPU produced by NVIDIA and introduces a number of new hardware and API features. This paper details the results of benchmarking the V100 GPU and demonstrates that it is a significant generational improvement, increasing memory bandwidth, cache bandwidth, and reducing latency. A major new addition is the Tensor core units, which have been marketed as deep learning acceleration features that enable the computation of a \(4\times 4\times 4\) half precision matrix-multiply-accumulate operation in a single clock cycle. This paper confirms that the Tensor cores offer considerable performance gains for half precision general matrix multiplication; however, programming them requires fine control of the memory hierarchy that is typically unnecessary for other applications.

[1] Matt Martineau,et al. Exploring On-Node Parallelism with Neutral, a Monte Carlo Neutral Particle Transport Mini-App , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[2] Michael B. Giles,et al. Benchmarking the IBM Power8 processor , 2015, CASCON.

[3] Simon McIntosh-Smith,et al. The Arch Project: Physics Mini-Apps for Algorithmic Exploration and Evaluating Programming Environments on HPC Architectures , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[4] Marco Maggioni,et al. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking , 2018, ArXiv.

[5] Simon McIntosh-Smith,et al. Portable methods for measuring cache hierarchy performance , 2017 .