A framework for dense triangular matrix kernels on various manycore architectures

We present a new high‐performance framework for dense triangular Basic Linear Algebra Subroutines (BLAS) kernels, ie, triangular matrix‐matrix multiplication (TRMM) and triangular solve (TRSM), on various manycore architectures. This is an extension of a previous work on a single GPU by the same authors, presented at the EuroPar'16 conference, in which we demonstrated the effectiveness of recursive formulations in enhancing the performance of these kernels. In this paper, the performance of triangular BLAS kernels on a single GPU is further enhanced by implementing customized in‐place CUDA kernels for TRMM and TRSM, which are called at the bottom of the recursion. In addition, a multi‐GPU implementation of TRMM and TRSM is proposed and we show an almost linear performance scaling, as the number of GPUs increases. Finally, the algorithmic recursive formulation of these triangular BLAS kernels is in fact oblivious to the targeted hardware architecture. We, therefore, port these recursive kernels to homogeneous x86 hardware architectures by relying on the vendor optimized BLAS implementations. Results reported on various hardware architectures highlight a significant performance improvement against state‐of‐the‐art implementations. These new kernels are freely available in the KAUST BLAS (KBLAS) open‐source library at https://github.com/ecrc/kblas.

[1]  Erik Elmroth,et al.  Applying recursion to serial and parallel QR factorization leads to better performance , 2000, IBM J. Res. Dev..

[2]  David E. Keyes,et al.  KBLAS: An Optimized Library for Dense Matrix-Vector Multiplication on GPU Accelerators , 2014, ACM Trans. Math. Softw..

[3]  David E. Keyes,et al.  Redesigning Triangular Dense Matrix Computations on GPUs , 2016, Euro-Par.

[4]  Robert A. van de Geijn,et al.  High-performance implementation of the level-3 BLAS , 2008, TOMS.

[5]  Jack J. Dongarra,et al.  Batched matrix computations on hardware accelerators based on GPUs , 2015, Int. J. High Perform. Comput. Appl..

[6]  Erik Elmroth,et al.  SIAM REVIEW c ○ 2004 Society for Industrial and Applied Mathematics Vol. 46, No. 1, pp. 3–45 Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software ∗ , 2022 .

[7]  Ioannis Caragiannis,et al.  Euro-Par 2012 : parallel processing workshops : BDMC, CGWS, HeteroPar, HiBB, OMHI, Paraphrase, PROPER, Resilience, UCHPC, VHPC, Rhodes Island, Greece, August 27-31, 2012 : revised selected papers , 2013 .

[8]  Robert A. van de Geijn,et al.  BLIS: A Framework for Rapidly Instantiating BLAS Functionality , 2015, ACM Trans. Math. Softw..

[9]  David E. Keyes,et al.  Optimizing Memory-Bound SYMV Kernel on GPU Hardware Accelerators , 2012, VECPAR.

[10]  Paolo Bientinesi,et al.  Recursive Algorithms for Dense Linear Algebra: The ReLAPACK Collection , 2016 .

[11]  Bo Kågström,et al.  GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark , 1998, TOMS.

[12]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[13]  Jack J. Dongarra,et al.  High-Performance Tensor Contractions for GPUs , 2016, ICCS.

[14]  David E. Keyes,et al.  High Performance Multi-GPU SpMV for Multi-component PDE-Based Applications , 2015, Euro-Par.

[15]  Robert A. van de Geijn,et al.  Level-3 BLAS on a GPU: Picking the low hanging fruit , 2012 .

[16]  Fred G. Gustavson,et al.  LAWRA: Linear Algebra with Recursive Algorithms , 2000, PARA.

[17]  Yi Yang,et al.  BLASX: A High Performance Level-3 BLAS Library for Heterogeneous Multi-GPU Computing , 2015, ICS.

[18]  James Demmel,et al.  FRPA: A Framework for Recursive Parallel Algorithms , 2015 .

[19]  Bo Kågström,et al.  Management of Deep Memory Hierarchies - Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Computations , 2004, PARA.

[20]  David E. Keyes,et al.  Systematic Approach in Optimizing Numerical Memory-Bound Kernels on GPU , 2012, Euro-Par Workshops.