Optimizing symmetric dense matrix-vector multiplication on GPUs

GPUs are excellent accelerators for data parallel applications with regular data access patterns. It is challenging, however, to optimize computations with irregular data access patterns on GPUs. One such computation is the Symmetric Matrix Vector product (SYMV) for dense linear algebra. Optimizing the SYMV kernel is important because it forms the basis of fundamental algorithms such as linear solvers and eigenvalue solvers on symmetric matrices. In this work, we present a new algorithm for optimizing the SYMV kernel on GPUs. Our optimized SYMV in single precision brings up to a 7x speed up compared to the (latest) CUBLAS 4.0 NVIDIA library on the GTX 280 GPU. Our SYMV kernel tuned for Fermi C2050 is 4.5x faster than CUBLAS 4.0 in single precision and 2x faster than CUBLAS 4.0 in double precision. Moreover, the techniques used and described in the paper are general enough to be of interest for developing high-performance GPU kernels beyond the particular case of SYMV.

[1]  Jack J. Dongarra,et al.  Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing , 2010, Parallel Comput..

[2]  R. C. Whaley,et al.  Automated empirical optimization of high performance floating point kernels , 2004 .

[3]  Jack Dongarra,et al.  Accelerating the reduction to upper Hessenberg form through hybrid GPU-based computing , 2009 .

[4]  Jack J. Dongarra,et al.  An Improved Magma Gemm For Fermi Graphics Processing Units , 2010, Int. J. High Perform. Comput. Appl..

[5]  Jack Dongarra,et al.  Scientific Computing with Multicore and Accelerators , 2010, Chapman and Hall / CRC computational science series.

[6]  Rajib Kumar Nath,et al.  Accelerating Dense Linear Algebra for GPUs, Multicores and Hybrid Architectures: an Autotuned and Algorithmic Approach , 2010 .

[7]  Emmanuel Agullo,et al.  A Fully Empirical Autotuned Dense QR Factorization for Multicore Architectures , 2011, Euro-Par.

[8]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[9]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[10]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[11]  Victor Eijkhout,et al.  Self-Adapting Linear Algebra Algorithms and Software , 2005, Proceedings of the IEEE.

[12]  Ed Anderson,et al.  LAPACK Users' Guide , 1995 .

[13]  Jack J. Dongarra,et al.  BLAS for GPUs , 2010, Scientific Computing with Multicore and Accelerators.

[15]  Jack J. Dongarra,et al.  Accelerating GPU Kernels for Dense Linear Algebra , 2010, VECPAR.

[16]  Jack J. Dongarra,et al.  Dense linear algebra solvers for multicore with GPU accelerators , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[17]  Jack J. Dongarra,et al.  A Note on Auto-tuning GEMM for GPUs , 2009, ICCS.

[18]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[19]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.