Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

We consider performance tuning, by code and data structure reorganization, of sparse matrix-vector multiply (SpM×V), one of the most important computational kernels in scientific applications. This paper addresses the fundamental questions of what limits exist on such performance tuning, and how closely tuned code approaches these limits. Specifically, we develop upper and lower bounds on the performance (Mflop/s) of SpM×V when tuned using our previously proposed register blocking optimization. These bounds are based on the non-zero pattern in the matrix and the cost of basic memory operations, such as cache hits and misses. We evaluate our tuned implementations with respect to these bounds using hardware counter data on 4 different platforms and on test set of 44 sparse matrices. We find that we can often get within 20% of the upper bound, particularly on class of matrices from finite element modeling (FEM) problems; on non-FEM matrices, performance improvements of 2× are still possible. Lastly, we present new heuristic that selects optimal or near-optimal register block sizes (the key tuning parameters) more accurately than our previous heuristic. Using the new heuristic, we show improvements in SpM×V performance (Mflop/s) by as much as 2.5× over an untuned implementation. Collectively, our results suggest that future performance improvements, beyond those that we have already demonstrated for SpM×V, will come from two sources: (1) consideration of higher-level matrix structures (e.g. exploiting symmetry, matrix reordering, multiple register block sizes), and (2) optimizing kernels with more opportunity for data reuse (e.g. sparse matrix-multiple vector multiply, multiplication of AT A by a vector).

[1]  Anne Lohrli Chapman and Hall , 1985 .

[2]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[3]  Rafael Hector Saavedra-Barrera,et al.  CPU performance evaluation and execution time prediction using narrow spectrum benchmarking , 1992 .

[4]  Ken Kennedy,et al.  Compiler blockability of numerical algorithms , 1992, Proceedings Supercomputing '92.

[5]  Olivier Temam,et al.  Characterizing the behavior of sparse algorithms on caches , 1992, Proceedings Supercomputing '92.

[6]  Richard Barrett,et al.  Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.

[7]  Josep-Lluís Larriba-Pey,et al.  Block algorithms for sparse matrix computations on high performance workstations , 1996, ICS '96.

[8]  Richard F. Barrett,et al.  Matrix Market: a web resource for test matrix collections , 1996, Quality of Numerical Software.

[9]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[10]  Paul Vinson Stodghill,et al.  A Relational Approach to the Automatic Generation of Sequential Sparse matrix Codes , 1997 .

[11]  Sivan Toledo,et al.  Improving the memory-system performance of sparse-matrix vector multiplication , 1997, IBM J. Res. Dev..

[12]  P. Sadayappan,et al.  On improving the performance of sparse matrix-vector multiplication , 1997, Proceedings Fourth International Conference on High-Performance Computing.

[13]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[14]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[15]  Jeremy G. Siek,et al.  A Rational Approach to Portable High Performance: The Basic Linear Algebra Instruction Set (BLAIS) and the Fixed Algorithm Size Template (FAST) Library , 1998, ECOOP Workshops.

[16]  Todd L. Veldhuizen,et al.  Arrays in Blitz++ , 1998, ISCOPE.

[17]  William Pugh,et al.  SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations , 1998, LCPC.

[18]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[19]  Roman Geus,et al.  Towards a fast parallel sparse matrix-vector multiplication , 2000, PARCO.

[20]  Francisco F. Rivera,et al.  Modeling and Improving Locality for Irregular Problems: Sparse Matrix-Vector Product on Cache Memories as a Cache Study , 1999, HPCN Europe.

[21]  A. Pinar,et al.  Improving Performance of Sparse Matrix-Vector Multiplication , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[22]  Sharad Malik,et al.  Cache miss equations: a compiler framework for analyzing and tuning memory behavior , 1999, TOPL.

[23]  Emilio L. Zapata,et al.  Memory Hierarchy Performance Prediction for Blocked Sparse Algorithms , 1999, Parallel Process. Lett..

[24]  Aart J. C. Bik,et al.  Automatic Nonzero Structure Analysis , 1999, SIAM J. Comput..

[25]  Eun Im,et al.  Optimizing the Performance of Sparse Matrix-Vector Multiplication , 2000 .

[26]  Dragan Mirkovic,et al.  An adaptive software library for fast Fourier transforms , 2000, ICS '00.

[27]  Jack J. Dongarra,et al.  A Scalable Cross-Platform Infrastructure for Application Performance Tuning Using Hardware Counters , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[28]  David E. Keyes,et al.  Towards Realistic Performance Bounds for Implicit CFD Codes , 2000 .

[29]  Michele Colajanni,et al.  PSBLAS: a library for parallel linear algebra computation on sparse matrices , 2000, TOMS.

[30]  José M. F. Moura,et al.  Fast Automatic Generation of DSP Algorithms , 2001, International Conference on Computational Science.

[31]  Roldan Pozo,et al.  NIST sparse BLAS user's guide , 2001 .

[32]  Greg M. Henry,et al.  Flexible High-Performance Matrix Multiply via a Self-Modifying Runtime Code , 2001 .

[33]  William Kahan,et al.  Document for the Basic Linear Algebra Subprograms (BLAS) standard: BLAS Technical Forum , 2001 .

[34]  P. Mannucci,et al.  Abstract , 2003 .

[35]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[36]  Sathish S. Vadhiyar,et al.  Towards an Accurate Model for Collective Communications , 2001, Int. J. High Perform. Comput. Appl..

[37]  Katherine A. Yelick,et al.  Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY , 2001, International Conference on Computational Science.

[38]  Robert A. van de Geijn,et al.  FLAME: Formal Linear Algebra Methods Environment , 2001, TOMS.

[39]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[40]  Siddhartha Chatterjee,et al.  Exact analysis of the cache behavior of nested loops , 2001, PLDI '01.