Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors

This paper presents a low-overhead optimizer for the ubiquitous sparse matrix-vector multiplication (SpMV) kernel. Architectural diversity among different processors together with structural diversity among different sparse matrices lead to bottleneck diversity. This justifies an SpMV optimizer that is both matrix- and architecture-adaptive through runtime specialization. To this direction, we present an approach that first identifies the performance bottlenecks of SpMV for a given sparse matrix on the target platform either through profiling or by matrix property inspection, and then selects suitable optimizations to tackle those bottlenecks. Our optimization pool is based on the widely used Compressed Sparse Row (CSR) sparse matrix storage format and has low preprocessing overheads, making our overall approach practical even in cases where fast decision making and optimization setup is required. We evaluate our optimizer on three x86-based computing platforms and demonstrate that it is able to distinguish and appropriately optimize SpMV for the majority of matrices in a representative test suite, leading to significant speedups over the CSR and Inspector-Executor CSR SpMV kernels available in the latest release of the Intel MKL library.

[1]  Sivan Toledo,et al.  Improving the memory-system performance of sparse-matrix vector multiplication , 1997, IBM J. Res. Dev..

[2]  Hyun Jin Moon,et al.  Fast Sparse Matrix-Vector Multiplication by Exploiting Variable Block Structure , 2005, HPCC.

[3]  Nectarios Koziris,et al.  CSX: an extended compression format for spmv on shared memory systems , 2011, PPoPP '11.

[4]  Andrew Lumsdaine,et al.  Accelerating sparse matrix computations via data compression , 2006, ICS '06.

[5]  Nectarios Koziris,et al.  Optimizing sparse matrix-vector multiplication using index and value compression , 2008, CF '08.

[6]  Yun Liang,et al.  Optimizing and auto-tuning scale-free sparse matrix-vector multiplication on Intel Xeon Phi , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[7]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2009, Parallel Comput..

[8]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[9]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[10]  John M. Mellor-Crummey,et al.  Optimizing Sparse Matrix–Vector Product Computations Using Unroll and Jam , 2004, Int. J. High Perform. Comput. Appl..

[11]  D. Sorensen Numerical methods for large eigenvalue problems , 2002, Acta Numerica.

[12]  Udo W. Pooch,et al.  A Survey of Indexing Techniques for Sparse Matrices , 1973, CSUR.

[13]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[14]  Michael T. Heath,et al.  Improving Performance of Sparse Matrix-Vector Multiplication , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[15]  Francisco F. Rivera,et al.  Improving the locality of the sparse matrix-vector product on shared memory multiprocessors , 2004, 12th Euromicro Conference on Parallel, Distributed and Network-Based Processing, 2004. Proceedings..

[16]  Yousef Saad,et al.  GPU-accelerated preconditioned iterative linear solvers , 2013, The Journal of Supercomputing.

[17]  Gerhard Wellein,et al.  A unified sparse matrix data format for modern processors with wide SIMD units , 2013, ArXiv.

[18]  Richard W. Vuduc,et al.  Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.

[19]  Richard W. Vuduc,et al.  Sparsity: Optimization Framework for Sparse Matrix Kernels , 2004, Int. J. High Perform. Comput. Appl..

[20]  Brian Vinter,et al.  CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication , 2015, ICS.

[21]  Kenli Li,et al.  Performance Analysis and Optimization for SpMV on GPU Using Probabilistic Modeling , 2015, IEEE Transactions on Parallel and Distributed Systems.

[22]  Joseph L. Greathouse,et al.  Efficient Sparse Matrix-Vector Multiplication on GPUs Using the CSR Storage Format , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[23]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[24]  Samuel Williams,et al.  Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[25]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[26]  Edward D. Lazowska,et al.  Quantitative system performance - computer system analysis using queueing network models , 1983, Int. CMG Conference.

[27]  John R. Gilbert,et al.  Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks , 2009, SPAA '09.

[28]  Kurt Keutzer,et al.  clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs , 2012, ICS '12.

[29]  Ninghui Sun,et al.  SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication , 2013, PLDI.

[30]  Calvin J. Ribbens,et al.  Pattern-based sparse matrix representation for memory-efficient SMVM kernels , 2009, ICS.

[31]  Ping Guo,et al.  A Performance Modeling and Optimization Analysis Tool for Sparse Matrix-Vector Multiplication on GPUs , 2014, IEEE Transactions on Parallel and Distributed Systems.