Autotuning Sparse Matrix-Vector Multiplication for Multicore

Sparse matrix-vector multiplication (SpMV) is an important kernel in scientific and engineering computing. Straightforward parallel implementations of SpMV often perform poorly, and with the increasing variety of architectural features in multicore processors, it is getting more difficult to determine the sparse matrix data structure and corresponding SpMV implementation that optimize performance. In this paper we present pOSKI, an autotuning system for SpMV that automatically searches over a large set of possible data structures and implementations to optimize SpMV performance on multicore platforms. pOSKI explores a design space that depends on both the nonzero pattern of the sparse matrix, typically not known until run-time, and the architecture, which is explored off-line as much as possible, in order to reduce tuning time. We demonstrate significant performance improvements compared to previous serial and parallel implementations, and compare performance to upper bounds based on architectural models. General Terms: Design, Experimentation, Performance Additional

[1]  Richard Vuduc,et al.  Automatic performance tuning of sparse matrix kernels , 2003 .

[2]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[3]  Liqiang Wang,et al.  Auto-Tuning CUDA Parameters for Sparse Matrix-Vector Multiplication on GPUs , 2010, 2010 International Conference on Computational and Information Sciences.

[4]  Michael M. Wolf,et al.  Optimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning , 2008 .

[5]  A. Usman,et al.  Review of Storage Techniques for Sparse Matrices , 2005, 2005 Pakistan Section Multitopic Conference.

[6]  Bruce Hendrickson,et al.  Optimizing parallel sparse matrix-vector multiplication by partitioning. , 2008 .

[7]  Michael Garland,et al.  Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[8]  Olav Aanes Fagerlund Multi-core programming with OpenCL: performance and portability: OpenCL in a memory bound scenario , 2010 .

[9]  Bora Uçar,et al.  On Two-Dimensional Sparse Matrix Partitioning: Models, Methods, and a Recipe , 2010, SIAM J. Sci. Comput..

[10]  Eduardo F. D'Azevedo,et al.  Vectorized Sparse Matrix Multiply for Compressed Row Storage Format , 2005, International Conference on Computational Science.

[11]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[12]  Jesús Carretero,et al.  Reordering Algorithms for Increasing Locality on Multicore Processors , 2008, 2008 10th IEEE International Conference on High Performance Computing and Communications.

[13]  Katherine A. Yelick,et al.  Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY , 2001, International Conference on Computational Science.

[14]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[15]  Samuel Williams,et al.  A Generalized Framework for Auto-tuning Stencil Computations , 2009 .

[16]  Stamatis Vassiliadis,et al.  A Hierarchical sparse matrix storage format for vector processors , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[17]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[18]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[19]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[20]  John M. Mellor-Crummey,et al.  Optimizing Sparse Matrix–Vector Product Computations Using Unroll and Jam , 2004, Int. J. High Perform. Comput. Appl..

[21]  Nectarios Koziris,et al.  A Comparative Study of Blocking Storage Methods for Sparse Matrices on Multicore Architectures , 2009, 2009 International Conference on Computational Science and Engineering.

[22]  Richard W. Vuduc,et al.  Sparsity: Optimization Framework for Sparse Matrix Kernels , 2004, Int. J. High Perform. Comput. Appl..

[23]  John R. Gilbert,et al.  Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks , 2009, SPAA '09.

[24]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2009, Parallel Comput..

[25]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[26]  Patrick R. Amestoy,et al.  An Approximate Minimum Degree Ordering Algorithm , 1996, SIAM J. Matrix Anal. Appl..

[27]  Ian P. King,et al.  An automatic reordering scheme for simultaneous equations derived from network systems , 1970 .

[28]  Ümit V. Çatalyürek,et al.  Decomposing Irregularly Sparse Matrices for Parallel Matrix-Vector Multiplication , 1996, IRREGULAR.

[29]  Matteo Frigo,et al.  A fast Fourier transform compiler , 1999, SIGP.

[30]  Samuel Williams,et al.  Auto-tuning performance on multicore computers , 2008 .

[31]  Michael Garland,et al.  Efficient Sparse Matrix-Vector Multiplication on CUDA , 2008 .

[32]  Eun Im,et al.  Optimizing the Performance of Sparse Matrix-Vector Multiplication , 2000 .

[33]  E. Cuthill,et al.  Reducing the bandwidth of sparse symmetric matrices , 1969, ACM '69.

[34]  E. Ng,et al.  An E cient Algorithm to Compute Row andColumn Counts for Sparse Cholesky Factorization , 1994 .

[35]  P. Sadayappan,et al.  On improving the performance of sparse matrix-vector multiplication , 1997, Proceedings Fourth International Conference on High-Performance Computing.