Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs

This paper presents a software framework for solving large numbers of relatively small matrix problems using GPUs. Our approach combines novel and existing HPC techniques to methodically apply performance analysis, kernel design, low-level optimizations, and autotuning to exceed in performance proprietary vendor libraries. As a case study, we discuss the fundamental matrix operations defined by the Basic Linear Algebra Subprograms (BLAS) standard. This case study is significantly important for wide range of applications, including astrophysics, tensor contractions, sparse direct solvers, and others. We provide a generic design that is capable of dealing with problems of different sizes, and handling the irregularity arising from size variations. The developed solution adopts a batched computation scheme, where the same operation is concurrently applied to all matrices within a single computational kernel. The paper discusses the data layout, kernel design, and optimization techniques. We also propose a design scheme that is centralized around matrix-matrix multiplication (GEMM) kernel, so that any improvement on this particular kernel propagates automatically to other routines. Our performance results show significant speedups using a Pascal generation GPU (Tesla P100) against state-of-the-art solutions using cuBLAS, as well as against two 10-core Haswell CPUs running the MKL library. This work is part of the MAGMA library.

[1]  Antonino Tumeo,et al.  Accelerating subsurface transport simulation on heterogeneous clusters , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[2]  David E. Bernholdt,et al.  Automatic code generation for many-body electronic structure methods: the tensor contraction engine , 2006 .

[3]  Jack J. Dongarra,et al.  Performance, Design, and Autotuning of Batched GEMM for GPUs , 2016, ISC.

[4]  Jack J. Dongarra,et al.  Autotuning GEMM Kernels for the Fermi GPU , 2012, IEEE Transactions on Parallel and Distributed Systems.

[5]  James Demmel,et al.  LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs , 2008 .

[6]  Jack J. Dongarra,et al.  Towards batched linear solvers on accelerated hardware platforms , 2015, PPOPP.

[7]  André Seznec,et al.  Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[8]  Merek A. Chertkow,et al.  Multicore and Accelerator Development for a Leadership-Class Stellar Astrophysics Code , 2012, PARA.

[9]  Chetan Jhurani,et al.  A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices , 2013, J. Parallel Distributed Comput..

[10]  Massimiliano Fatica,et al.  Power/Performance Trade-Offs of Small Batched LU Based Solvers on GPUs , 2013, Euro-Par.

[11]  Ninghui Sun,et al.  Fast implementation of DGEMM on Fermi GPU , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[12]  Jack J. Dongarra,et al.  Experiences in autotuning matrix multiplication for energy minimization on GPUs , 2015, Concurr. Comput. Pract. Exp..

[13]  Jack J. Dongarra,et al.  An Improved Magma Gemm For Fermi Graphics Processing Units , 2010, Int. J. High Perform. Comput. Appl..

[14]  Jack J. Dongarra,et al.  Batched matrix computations on hardware accelerators based on GPUs , 2015, Int. J. High Perform. Comput. Appl..

[15]  Endong Wang,et al.  Intel Math Kernel Library , 2014 .

[16]  Timothy A. Davis,et al.  Algorithm 9xx: Sparse QR Factorization on the GPU , 2015 .

[17]  Jack J. Dongarra,et al.  Dense linear algebra solvers for multicore with GPU accelerators , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[18]  Jack J. Dongarra,et al.  A Note on Auto-tuning GEMM for GPUs , 2009, ICCS.

[19]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[20]  Jack J. Dongarra,et al.  Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs , 2016, IEEE Transactions on Parallel and Distributed Systems.

[21]  Jack J. Dongarra,et al.  A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations , 2015, ISC.

[22]  David E. Keyes,et al.  Redesigning Triangular Dense Matrix Computations on GPUs , 2016, Euro-Par.

[23]  Kurt Keutzer,et al.  A Predictive Model for Solving Small Linear Algebra Problems in GPU Registers , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.