STOMP: Statistical Techniques for Optimizing and Modeling Performance of Blocked Sparse Matrix Vector Multiplication

Sparse-matrix vector multiplication (SpMV) is the core compute routine for several scientific and commercial codebases. Because of its extremely irregular memory accesses (low temporal locality), indirect memory referencing (low spatial locality), low arithmetic intensity, and the non-zero pattern and non-zero density of the matrix, SpMV achieves a mere 10% of peak system performance. Because sparse matrices have extremely varied non-zero patterns and densities, performance of SpMV is hard to predict. Blocking sparse matrices increases arithmetic intensity and spatial locality during SpMV operations, thereby improving SpMV performance. However, selection of an incorrect block size can produce performance degradation as high as 70%. In this study, we describe the STOMP approach of using statistical techniques to predict run time of SpMV in PETSc for new matrices with mean accuracy of 93.52%. We use these statistical prediction models to guide block size selection to achieve up to 100% of optimal performance, comparable to that attained through exhaustive block size search. Our block size selection results produce an average of 55.56% speedup over default SpMV options. On the same set of matrices used in the SPARSITY SpMV framework, STOMP yields a 54.46% speedup while SPARSITY yields a 31.62% speedup over the same default.

[1]  Henk A. van der Vorst,et al.  Bi-CGSTAB: A Fast and Smoothly Converging Variant of Bi-CG for the Solution of Nonsymmetric Linear Systems , 1992, SIAM J. Sci. Comput..

[2]  William Gropp,et al.  Efficient Management of Parallelism in Object-Oriented Numerical Software Libraries , 1997, SciTools.

[3]  Timothy A. Davis,et al.  A column approximate minimum degree ordering algorithm , 2000, TOMS.

[4]  P. Sadayappan,et al.  An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on GPUs , 2014, ICS '14.

[5]  Srinivasan Parthasarathy,et al.  Automatic Selection of Sparse Matrix Representation on GPUs , 2015, ICS.

[6]  Victor Eijkhout,et al.  Performance Optimization and Modeling of Blocked Sparse Kernels , 2007, Int. J. High Perform. Comput. Appl..

[7]  William Gropp,et al.  PETSc Users Manual Revision 3.4 , 2016 .

[8]  Richard W. Vuduc,et al.  Sparsity: Optimization Framework for Sparse Matrix Kernels , 2004, Int. J. High Perform. Comput. Appl..

[9]  Y. Saad,et al.  GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems , 1986 .

[10]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[11]  Brian Vinter,et al.  CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication , 2015, ICS.

[12]  Richard Vuduc,et al.  Automatic performance tuning of sparse matrix kernels , 2003 .

[13]  Nectarios Koziris,et al.  A Comparative Study of Blocking Storage Methods for Sparse Matrices on Multicore Architectures , 2009, 2009 International Conference on Computational Science and Engineering.

[14]  Richard W. Vuduc,et al.  Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.

[15]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[16]  A. Pinar,et al.  Improving Performance of Sparse Matrix-Vector Multiplication , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[17]  Ping Guo,et al.  A Performance Modeling and Optimization Analysis Tool for Sparse Matrix-Vector Multiplication on GPUs , 2014, IEEE Transactions on Parallel and Distributed Systems.

[18]  A. H. Sherman,et al.  Comparative Analysis of the Cuthill–McKee and the Reverse Cuthill–McKee Ordering Algorithms for Sparse Matrices , 1976 .

[19]  Nectarios Koziris,et al.  Understanding the Performance of Sparse Matrix-Vector Multiplication , 2008, 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008).

[20]  Matthew G. Knepley,et al.  PETSc Users Manual: Revision 3.11 , 2019 .

[21]  Laura Grigori,et al.  A New Scheduling Algorithm for Parallel Sparse LU Factorization with Static Pivoting , 2002, ACM/IEEE SC 2002 Conference (SC'02).