Auto-tuning of Sparse Matrix-Vector Multiplication on Graphics Processors

We present a heuristics-based auto-tuner for sparse matrix-vector multiplication (SpMV) on GPUs. For a given sparse matrix, our framework delivers a high performance SpMV kernel which combines the use of the most effective storage format and tuned parameters of the corresponding code targeting the underlying GPU architecture. 250 matrices from 23 application areas are used to develop heuristics which prune the auto-tuning search space. For performance evaluation, we use 59 matrices from 12 application areas and different NVIDIA GPUs. The maximum speedup of our framework delivered kernels over NVIDIA library kernels is 7x. For most matrices, the performance of the kernels delivered by our framework is within 1% of the kernels found using exhaustive search. Compared to exhaustive search auto-tuning, our framework can be more than one order of magnitude faster.

[1]  Richard Vuduc,et al.  Automatic performance tuning of sparse matrix kernels , 2003 .

[2]  Richard W. Vuduc,et al.  Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.

[3]  Liwen Chang,et al.  Optimization and architecture effects on GPU computing workload performance , 2012, 2012 Innovative Parallel Computing (InPar).

[4]  Michael Garland,et al.  Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[5]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[6]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[7]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[8]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[9]  Dominik Grewe,et al.  Automatically generating and tuning GPU code for sparse matrix-vector multiplication from a high-level representation , 2011, GPGPU-4.

[10]  He Huang,et al.  A model-driven partitioning and auto-tuning integrated framework for sparse matrix-vector multiplication on GPUs , 2011 .

[11]  John E. Stone,et al.  GPU clusters for high-performance computing , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[12]  Walid A. Abu-Sufah,et al.  An Effective Approach for Implementing Sparse Matrix-Vector Multiplication on Graphics Processing Units , 2012, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems.

[13]  Daniel S. Katz,et al.  Proceedings of the 2011 TeraGrid Conference: Extreme Digital Discovery , 2011 .

[14]  Eurípides Montagne,et al.  An optimal storage format for sparse matrices , 2004, Inf. Process. Lett..

[15]  Arutyun Avetisyan,et al.  Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures , 2010, HiPEAC.

[16]  Samuel Williams,et al.  Sparse Matrix-Vector Multiplication on Multicore and Accelerators , 2010 .

[17]  Alistair P. Rendell,et al.  From Sparse Matrix to Optimal GPU CUDA Sparse Matrix Vector Product Implementation , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.