Analyzing GPU Tensor Core Potential for Fast Reductions

The Nvidia GPU architecture has introduced new computing elements such as the tensor cores, which are special processing units dedicated to perform fast matrix-multiply-accumulate (MMA) operations and accelerate Deep Learning applications. In this work we present the idea of using tensor cores for a different purpose such as the parallel arithmetic reduction problem, and propose a new GPU tensor-core based algorithm as well as analyze its potential performance benefits in comparison to a traditional GPU-based one. The proposed method, encodes the reduction of n numbers as a set of m × m MMA tensor-core operations (for Nvidia’s Volta architecture m = 16) and takes advantage from the fact that each MMA operation takes just one GPU cycle. When analyzing the cost under a simplified GPU computing model, the result is that the new algorithm manages to reduce a problem of n numbers in $T\left( n \right) = 5{\log _{{m^2}}}\left( n \right)$ steps with a speedup of $S = \frac{4}{5}{\log _2}\left( {{m^2}} \right)$.

[1]  Nancy Hitschfeld-Kahler,et al.  GPU parallel simulation algorithm of Brownian particles with excluded volume using Delaunay triangulations , 2017, Comput. Phys. Commun..

[2]  Wei Huang,et al.  Adaptive multi-GPU Exchange Monte Carlo for the 3D Random Field Ising Model , 2015, Comput. Phys. Commun..

[3]  Nancy Hitschfeld-Kahler,et al.  A Survey on Parallel Computing and its Applications in Data-Parallel Problems Using GPU Architectures , 2014 .

[4]  Hans Vandierendonck,et al.  A Case Study of OpenMP Applied to Map/Reduce-Style Computations , 2015, IWOMP.

[5]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[6]  Timo Aila,et al.  Interactive reconstruction of Monte Carlo image sequences using a recurrent denoising autoencoder , 2017, ACM Trans. Graph..

[7]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[8]  Jeffrey S. Vetter,et al.  NVIDIA Tensor Core Programmability, Performance & Precision , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[9]  Georg Hager,et al.  Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[10]  Xun Jia,et al.  GPU-based fast Monte Carlo simulation for radiotherapy dose calculation. , 2011, Physics in medicine and biology.

[11]  Mark J. Harris Mapping computational concepts to GPUs , 2005, SIGGRAPH Courses.

[12]  Benjamin Bustos,et al.  Competitiveness of a Non-Linear Block-Space GPU Thread Map for Simplex Domains , 2018, IEEE Transactions on Parallel and Distributed Systems.

[13]  Nancy Hitschfeld-Kahler,et al.  A high-speed tracking algorithm for dense granular media , 2018, Comput. Phys. Commun..

[14]  Richard P. Brent,et al.  The Parallel Evaluation of General Arithmetic Expressions , 1974, JACM.

[15]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[16]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.