KBLAS: An Optimized Library for Dense Matrix-Vector Multiplication on GPU Accelerators

KBLAS is an open-source, high-performance library that provides optimized kernels for a subset of Level 2 BLAS functionalities on CUDA-enabled GPUs. Since performance of dense matrix-vector multiplication is hindered by the overhead of memory accesses, a double-buffering optimization technique is employed to overlap data motion with computation. After identifying a proper set of tuning parameters, KBLAS efficiently runs on various GPU architectures while avoiding code rewriting and retaining compliance with the standard BLAS API. Another optimization technique allows ensuring coalesced memory access when dealing with submatrices, especially for high-level dense linear algebra algorithms. All KBLAS kernels have been leveraged to a multi-GPU environment, which requires the introduction of new APIs. Considering general matrices, KBLAS is very competitive with existing state-of-the-art kernels and provides a smoother performance across a wide range of matrix dimensions. Considering symmetric and Hermitian matrices, the KBLAS performance outperforms existing state-of-the-art implementations on all matrix sizes and achieves asymptotically up to 50p and 60p speedup against the best competitor on single GPU and multi-GPUs systems, respectively. Performance results also validate our performance model. A subset of KBLAS high-performance kernels have been integrated into NVIDIA's standard BLAS implementation (cuBLAS) for larger dissemination, starting from version 6.0.

[1]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[2]  David E. Keyes,et al.  Systematic Approach in Optimizing Numerical Memory-Bound Kernels on GPU , 2012, Euro-Par Workshops.

[3]  Ninghui Sun,et al.  Fast implementation of DGEMM on Fermi GPU , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[4]  Jack J. Dongarra,et al.  Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems , 2014, Concurr. Comput. Pract. Exp..

[5]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[6]  David E. Keyes,et al.  Optimizing Memory-Bound SYMV Kernel on GPU Hardware Accelerators , 2012, VECPAR.

[7]  Jack Dongarra,et al.  LAPACK Users' Guide, 3rd ed. , 1999 .

[8]  Jack J. Dongarra,et al.  Optimizing symmetric dense matrix-vector multiplication on GPUs , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[9]  Jack J. Dongarra,et al.  BLAS for GPUs , 2010, Scientific Computing with Multicore and Accelerators.

[10]  Jack J. Dongarra,et al.  Accelerating GPU Kernels for Dense Linear Algebra , 2010, VECPAR.

[11]  David E. Keyes,et al.  High Performance Pseudo-analytical Simulation of Multi-Object Adaptive Optics over Multi-GPU Systems , 2014, Euro-Par.

[12]  Jie Cheng,et al.  Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..

[13]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[14]  J. Xu OpenCL – The Open Standard for Parallel Programming of Heterogeneous Systems , 2009 .

[15]  Eric J. Kelmelis,et al.  CULA: hybrid GPU accelerated linear algebra routines , 2010, Defense + Commercial Sensing.

[16]  Jack J. Dongarra,et al.  An Improved Magma Gemm For Fermi Graphics Processing Units , 2010, Int. J. High Perform. Comput. Appl..

[17]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, ACM Trans. Graph..

[19]  Jack J. Dongarra,et al.  Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing , 2010, Parallel Comput..