GPU Tensor Cores for Fast Arithmetic Reductions

This article proposes a parallel algorithm for computing the arithmetic reduction of <inline-formula><tex-math notation="LaTeX">$n$</tex-math><alternatives><mml:math><mml:mi>n</mml:mi></mml:math><inline-graphic xlink:href="navarro-ieq1-3011893.gif"/></alternatives></inline-formula> numbers as a set of matrix-multiply accumulate (MMA) operations that are executed simultaneously by GPU tensor cores. The analysis, assuming tensors of size <inline-formula><tex-math notation="LaTeX">$m \times m$</tex-math><alternatives><mml:math><mml:mrow><mml:mi>m</mml:mi><mml:mo>×</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="navarro-ieq2-3011893.gif"/></alternatives></inline-formula>, shows that the proposed algorithm has a parallel running time of <inline-formula><tex-math notation="LaTeX">$T(n)=5 log_{m^2}{n}$</tex-math><alternatives><mml:math><mml:mrow><mml:mi>T</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>n</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mn>5</mml:mn><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:msub><mml:mi>g</mml:mi><mml:msup><mml:mi>m</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:msub><mml:mi>n</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="navarro-ieq3-3011893.gif"/></alternatives></inline-formula> and a speedup of <inline-formula><tex-math notation="LaTeX">$S=\frac{4}{5} log_{2}{m^2}$</tex-math><alternatives><mml:math><mml:mrow><mml:mi>S</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mn>4</mml:mn><mml:mn>5</mml:mn></mml:mfrac><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:msub><mml:mi>g</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:msup><mml:mi>m</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:math><inline-graphic xlink:href="navarro-ieq4-3011893.gif"/></alternatives></inline-formula> over a canonical parallel reduction. Experimental performance results on a Tesla V100 GPU show that the tensor-core based approach is energy efficient and runs up to <inline-formula><tex-math notation="LaTeX">$\sim 3.2 \times$</tex-math><alternatives><mml:math><mml:mrow><mml:mo>∼</mml:mo><mml:mn>3</mml:mn><mml:mo>.</mml:mo><mml:mn>2</mml:mn><mml:mo>×</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="navarro-ieq5-3011893.gif"/></alternatives></inline-formula> and <inline-formula><tex-math notation="LaTeX">$2\times$</tex-math><alternatives><mml:math><mml:mrow><mml:mn>2</mml:mn><mml:mo>×</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="navarro-ieq6-3011893.gif"/></alternatives></inline-formula> faster than a standard GPU-based reduction and Nvidia's CUB library, respectively, while keeping the numerical error below 1 percent with respect to a double precision CPU reduction. The chained design of the algorithm allows a flexible configuration of GPU thread-blocks and the optimal values found through experimentation agree with the theoretical ones. The results obtained in this work show that GPU tensor cores are relevant not only for Deep Learning or Linear Algebra computations, but also for applications that require the acceleration of large summations.

[1]  Ye Zhao,et al.  Lattice Boltzmann based PDE solver on the GPU , 2008, The Visual Computer.

[2]  Matt Martineau,et al.  Benchmarking the NVIDIA V100 GPU and Tensor Cores , 2018, Euro-Par Workshops.

[3]  Nancy Hitschfeld-Kahler,et al.  A Survey on Parallel Computing and its Applications in Data-Parallel Problems Using GPU Architectures , 2014 .

[4]  Nancy Hitschfeld-Kahler,et al.  GPU parallel simulation algorithm of Brownian particles with excluded volume using Delaunay triangulations , 2017, Comput. Phys. Commun..

[5]  Richard P. Brent,et al.  The Parallel Evaluation of General Arithmetic Expressions , 1974, JACM.

[6]  Nancy Hitschfeld-Kahler,et al.  A high-speed tracking algorithm for dense granular media , 2018, Comput. Phys. Commun..

[7]  Timothy M. Chan More algorithms for all-pairs shortest paths in weighted graphs , 2007, STOC '07.

[8]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[9]  Eric S. Chung,et al.  A reconfigurable fabric for accelerating large-scale datacenter services , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[10]  Mark J. Harris Mapping computational concepts to GPUs , 2005, SIGGRAPH Courses.

[11]  John D. Owens,et al.  Multi-GPU MapReduce on GPU Clusters , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[12]  Hans Vandierendonck,et al.  A Case Study of OpenMP Applied to Map/Reduce-Style Computations , 2015, IWOMP.

[13]  Benjamin Bustos,et al.  Competitiveness of a Non-Linear Block-Space GPU Thread Map for Simplex Domains , 2018, IEEE Transactions on Parallel and Distributed Systems.

[14]  Jeffrey S. Vetter,et al.  NVIDIA Tensor Core Programmability, Performance & Precision , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[15]  Jack J. Dongarra,et al.  Investigating half precision arithmetic to accelerate dense linear system solvers , 2017, ScalA@SC.

[16]  Martin Margala,et al.  Exploration of Low Numeric Precision Deep Learning Inference Using Intel® FPGAs , 2018, FCCM.

[17]  Makoto Taiji,et al.  A Comparative Study on ASIC, FPGAs, GPUs and General Purpose Processors in the O(N^2) Gravitational N-body Simulation , 2009, 2009 NASA/ESA Conference on Adaptive Hardware and Systems.

[18]  Georg Hager,et al.  Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[19]  Alberto Cano,et al.  A survey on graphic processing unit computing for large‐scale data mining , 2018, WIREs Data Mining Knowl. Discov..

[20]  Kashif Nizam Khan,et al.  RAPL in Action: Experiences in Using RAPL for Power Measurements , 2020 .

[21]  Wei Huang,et al.  Adaptive multi-GPU Exchange Monte Carlo for the 3D Random Field Ising Model , 2015, Comput. Phys. Commun..

[22]  Marco Maggioni,et al.  Dissecting the NVidia Turing T4 GPU via Microbenchmarking , 2019, ArXiv.

[23]  S. Wolfram Statistical mechanics of cellular automata , 1983 .

[24]  Xun Jia,et al.  GPU-based fast Monte Carlo simulation for radiotherapy dose calculation. , 2011, Physics in medicine and biology.

[25]  Joachim von zur Gathen,et al.  Parallel Arithmetic Computations: A Survey , 1986, MFCS.

[26]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[27]  Wang,et al.  In-Datacenter Performance Analysis of a Tensor Processing UnitTM , .

[28]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[29]  David A. Patterson,et al.  Motivation for and Evaluation of the First Tensor Processing Unit , 2018, IEEE Micro.

[30]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[31]  H.-P. Seidel,et al.  Realtime Ray Tracing on GPU with BVH-based Packet Traversal , 2007, 2007 IEEE Symposium on Interactive Ray Tracing.

[32]  Cristóbal A. Navarro,et al.  Analyzing GPU Tensor Core Potential for Fast Reductions , 2018, 2018 37th International Conference of the Chilean Computer Science Society (SCCC).

[33]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[34]  Jeroen Bédorf,et al.  A sparse octree gravitational N-body code that runs entirely on the GPU processor , 2011, J. Comput. Phys..

[35]  Marco Maggioni,et al.  Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking , 2018, ArXiv.

[36]  Rajesh Gupta,et al.  Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs , 2017, FPGA.

[37]  Martin Margala,et al.  Exploration of Low Numeric Precision Deep Learning Inference Using Intel® FPGAs , 2018, 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[38]  Jinjun Xiong,et al.  Accelerating reduction and scan using tensor core units , 2018, ICS.