Tile Low-Rank GEMM Using Batched Operations on GPUs

Dense General Matrix-Matrix (GEMM) multiplication is a core operation of the Basic Linear Algebra Subroutines (BLAS) library, and therefore, often resides at the bottom of the traditional software stack for many scientific applications. In fact, chip manufacturers give a special attention to the GEMM kernel implementation since this is exactly where most of the high-performance software libraries extract hardware performance. With the emergence of big data applications involving large data-sparse, hierarchically low-rank matrices, the off-diagonal tiles can be compressed to reduce the algorithmic complexity and the memory footprint. The resulting tile low-rank (TLR) data format is composed of small data structures, which retain the most significant information for each tile. However, to operate on low-rank tiles, a new GEMM operation and its corresponding API have to be designed on GPUs so that the data sparsity structure of the matrix can be exploited while leveraging the underlying TLR compression format. The main idea consists of aggregating all operations into a single kernel launch to compensate for their low arithmetic intensities and to mitigate the data transfer overhead on GPUs. The new TLR-GEMM kernel outperforms the cuBLAS dense batched GEMM by more than an order of magnitude and creates new opportunities for TLR advanced algorithms.

[1]  Anima Anandkumar,et al.  Tensor Contractions with Extended BLAS Kernels on CPU and GPU , 2016, 2016 IEEE 23rd International Conference on High Performance Computing (HiPC).

[2]  Jack J. Dongarra,et al.  Performance, Design, and Autotuning of Batched GEMM for GPUs , 2016, ISC.

[3]  Steffen Börm,et al.  Data-sparse Approximation by Adaptive ℋ2-Matrices , 2002, Computing.

[4]  S. Börm Efficient Numerical Methods for Non-local Operators , 2010 .

[5]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[6]  W. Hackbusch A Sparse Matrix Arithmetic Based on $\Cal H$-Matrices. Part I: Introduction to ${\Cal H}$-Matrices , 1999, Computing.

[7]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[8]  David E. Keyes,et al.  Performance optimization of Sparse Matrix‐Vector Multiplication for multi‐component PDE‐based applications using GPUs , 2016, Concurr. Comput. Pract. Exp..

[9]  David E. Keyes,et al.  Tile Low Rank Cholesky Factorization for Climate/Weather Modeling Applications on Manycore Architectures , 2017, ISC.

[10]  David E. Keyes,et al.  Accelerated Cyclic Reduction: A distributed-memory fast solver for structured linear systems , 2017, Parallel Comput..

[11]  Eric Darve,et al.  An $$\mathcal O (N \log N)$$O(NlogN)  Fast Direct Solver for Partial Hierarchically Semi-Separable Matrices , 2013 .

[12]  Pieter Ghysels,et al.  A Distributed-Memory Package for Dense Hierarchically Semi-Separable Matrix Computations Using Randomization , 2015, ACM Trans. Math. Softw..

[13]  Jean-Yves L'Excellent,et al.  Improving Multifrontal Methods by Means of Block Low-Rank Representations , 2015, SIAM J. Sci. Comput..

[14]  W. Hackbusch,et al.  On H2-Matrices , 2000 .

[15]  David E. Keyes,et al.  Real-Time Massively Distributed Multi-object Adaptive Optics Simulations for the European Extremely Large Telescope , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[16]  David E. Keyes,et al.  Batched QR and SVD Algorithms on GPUs with Applications in Hierarchical Matrix Compression , 2017, Parallel Comput..

[17]  Simon D. Hammond,et al.  Designing Vector-Friendly Compact BLAS and LAPACK Kernels , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Jack J. Dongarra,et al.  An extended set of FORTRAN basic linear algebra subprograms , 1988, TOMS.

[19]  Hatem Ltaief,et al.  Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs , 2019, ACM Trans. Math. Softw..

[20]  Jack J. Dongarra,et al.  High-Performance Tensor Contractions for GPUs , 2016, ICCS.

[21]  Alexander Heinecke,et al.  LIBXSMM: Accelerating Small Matrix Multiplications by Runtime Code Generation , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  Marc G. Genton,et al.  Correlation Models for Temperature Fields , 2011 .

[23]  Ronald Kriemann,et al.  $${{\fancyscript{H}}} $$H-LU factorization on many-core systems , 2013, Comput. Vis. Sci..

[24]  Wolfgang Hackbusch,et al.  Construction and Arithmetics of H-Matrices , 2003, Computing.

[25]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[26]  David E. Keyes,et al.  Exploiting Data Sparsity for Large-Scale Matrix Computations , 2018, Euro-Par.

[27]  E. Tyrtyshnikov Mosaic-Skeleton approximations , 1996 .