High-Performance Tensor Contractions for GPUs

We present a computational framework for high-performance tensor contractions on GPUs. High-performance is difficult to obtain using existing libraries, especially for many independent contractions where each contraction is very small, e.g., sub-vector/warp in size. However, using our framework to batch contractions plus application-specifics, we demonstrate close to peak performance results. In particular, to accelerate large scale tensor-formulated high-order finite element method (FEM) simulations, which is the main focus and motivation for this work, we represent contractions as tensor index reordering plus matrix-matrix multiplications (GEMMs). This is a key factor to achieve algorithmically many-fold acceleration (vs. not using it) due to possible reuse of data loaded in fast memory. In addition to using this context knowledge, we design tensor data-structures, tensor algebra interfaces, and new tensor contraction algorithms and implementations to achieve 90+% of a theoretically derived peak on GPUs. On a K40c GPU for contractions resulting in GEMMs on square matrices of size 8 for example, we are 2.8 faster than CUBLAS, and 8.5 faster than MKL on 16 cores of Intel Xeon E5-2670 (Sandy Bridge) 2.60GHz CPUs. Finally, we apply autotuning and code generation techniques to simplify tuning and provide an architecture-aware, user-friendly interface.

[1]  Jack J. Dongarra,et al.  An Improved Magma Gemm For Fermi Graphics Processing Units , 2010, Int. J. High Perform. Comput. Appl..

[2]  Jack J. Dongarra,et al.  A Note on Auto-tuning GEMM for GPUs , 2009, ICCS.

[3]  Jack J. Dongarra,et al.  Batched matrix computations on hardware accelerators based on GPUs , 2015, Int. J. High Perform. Comput. Appl..

[4]  Jack J. Dongarra,et al.  Towards dense linear algebra for hybrid GPU accelerated manycore systems , 2009, Parallel Comput..

[5]  Jack J. Dongarra,et al.  LU Factorization of Small Matrices: Accelerating Batched DGETRF on the GPU , 2014, 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS).

[6]  Chun Chen,et al.  Speeding up Nek5000 with autotuning and specialization , 2010, ICS '10.

[7]  Jack J. Dongarra,et al.  High-Performance Matrix-Matrix Multiplications of Very Small Matrices , 2016, Euro-Par.

[8]  Tamara G. Kolda,et al.  Tensor Decompositions and Applications , 2009, SIAM Rev..

[9]  Robert J. Harrison,et al.  Model-Driven SIMD Code Generation for a Multi-resolution Tensor Kernel , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[10]  Erwin Laure,et al.  OpenACC acceleration of the Nek5000 spectral element code , 2015, Int. J. High Perform. Comput. Appl..

[11]  Sriram Krishnamoorthy,et al.  Acceleration of Streamed Tensor Contraction Expressions on GPGPU-Based Clusters , 2010, 2010 IEEE International Conference on Cluster Computing.

[12]  Prasanna Balaprakash,et al.  Generating Efficient Tensor Contractions for GPUs , 2015, 2015 44th International Conference on Parallel Processing.

[13]  John F. Stanton,et al.  A massively parallel tensor contraction framework for coupled-cluster computations , 2014, J. Parallel Distributed Comput..

[14]  Jack J. Dongarra,et al.  Performance, Design, and Autotuning of Batched GEMM for GPUs , 2016, ISC.

[15]  Jack J. Dongarra,et al.  A Step towards Energy Efficient Computing: Redesigning a Hydrodynamic Application on CPU-GPU , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[16]  Michael W. Mahoney,et al.  Future Directions in Tensor-Based Computation and Modeling , 2009 .

[17]  Wen-mei W. Hwu,et al.  GPU Computing Gems Jade Edition , 2011 .

[18]  Tzanio V. Kolev,et al.  High-Order Curvilinear Finite Element Methods for Lagrangian Hydrodynamics , 2012, SIAM J. Sci. Comput..

[19]  Jack J. Dongarra,et al.  A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations , 2015, ISC.

[20]  Jack Dongarra,et al.  Towards a High-Performance Tensor Algebra Package for Accelerators , 2015 .