Optimizing the Performance of Sparse Matrix-Vector Multiplication

Sparse matrix operations dominate the performance of many scientific and engineering applications. In particular, iterative methods are commonly used in algorithms for linear systems, least squares problems, and eigenvalue problems, which involve a sparse matrix-vector product in the inner loop. The performance of sparse matrix algorithms is often disappointing on modern machines because the algorithms have poor temporal and spatial locality, and are therefore limited by the speed of main memory. Unfortunately, the performance gap between memory and processing is steadily increasing, as processor performance increases by roughly 60% every year, while memory latency drops by only 7%. Performance is also highly dependent on the nonzero structure of the sparse matrix, the organization of the data and its computation, and the exact parameters of the hardware memory system. This thesis presents a toolkit called Sparsity for the automatic optimization of sparse matrix-vector multiplication. We start with an extensive study of possible memory hierarchy optimizations, in particular, reorganization of the matrix and computation around blocks of the matrix. The research demonstrates that certain kinds of blocking can be effective for both registers and caches, although the nature of that tiling is quite different due to the differences in size between typical register sets and caches. Both types of blocking are shown to be highly effective in some matrices, but ineffective in others, and the choice of block size is also shown to be highly dependent on the matrix and the machine. Thus, to automatically determine when and how the optimizations should be applied, we employ a combination of search over a set of possible optimized versions, along with newly devised performance models to eliminate or constrain the search to make it practical. We also consider a common variation of basic sparse matrix-vector multiplication in which a sparse matrix is multiplied by a set of dense vectors. This operation arises, for example, when there are multiple right-hand sides in a linear solver, or when a higher level algorithm has been blocked. The introduction of multiple vectors offers enormous optimization opportunities, effectively changing a matrix-vector operation into a matrix-matrix operation. It is well known that for dense matrices, the latter algorithm has much higher data reuse than the former, and so can achieve much better performance; the same is true in the sparse case. The Sparsity system is designed as a web service, so scientists and engineers can easily obtain highly optimized sparse matrix routines without needing to understand the specifics of the optimization techniques or how they are selected. This thesis also reports on an extensive performance study of over 40 matrices on a variety of machines. The matrices are taken from various scientific and engineering problems, as well as from linear programming and data mining. The machines include the Alpha 21164, UltraSPARC I, MIPS R10000 and PowerPC 604e. These benchmark results are useful for understanding the performance differences across application domains, the effectiveness of the optimizations, and the costs associated with evaluating our performance models in applying the optimizations. The conclusion is that Sparsity is highly effective, producing routines that are up to 3.1 times faster for a single vector and 6.2 times faster for multiple vectors.

[1]  Youcef Saad,et al.  A Basic Tool Kit for Sparse Matrix Computations , 1990 .

[2]  Yousef Saad,et al.  P-Sparslib: A Portable Library Of Distributed Memory Sparse Iterative Solvers , 1995, PACT 1995.

[3]  Yves Robert,et al.  Static tiling for heterogeneous computing platforms , 1999, Parallel Comput..

[4]  M. Sadkane A block Arnoldi-Chebyshev method for computing the leading eigenpairs of large sparse unsymmetric matrices , 1993 .

[5]  William Gropp,et al.  PETSc 2.0 users manual , 2000 .

[6]  Jitendra Malik,et al.  Motion segmentation and tracking using normalized cuts , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[7]  C. Loan,et al.  A Storage-Efficient $WY$ Representation for Products of Householder Transformations , 1989 .

[8]  Xiaoye Sherry Li,et al.  Sparse Gaussian Elimination on High Performance Computers , 1996 .

[9]  Jack Dongarra,et al.  ScaLAPACK Users' Guide , 1987 .

[10]  E. Im,et al.  Optimizing Sparse Matrix Vector Multiplication on SMP , 1999, PPSC.

[11]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[12]  Steven Mark Carr,et al.  Memory-hierarchy management , 1993 .

[13]  Bo Kågström,et al.  Portable High Performance GEMM-Based Level 3 BLAS , 1993, PPSC.

[14]  J. G. Lewis,et al.  A Shifted Block Lanczos Algorithm for Solving Sparse Symmetric Generalized Eigenproblems , 1994, SIAM J. Matrix Anal. Appl..

[15]  Steve Carr,et al.  Compiler blockability of dense matrix factorizations , 1997, TOMS.

[16]  J. R. Johnson,et al.  Implementation of Strassen's Algorithm for Matrix Multiplication , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[17]  Keshav Pingali,et al.  An experimental evaluation of tiling and shackling for memory hierarchy management , 1999, ICS '99.

[18]  James Demmel,et al.  LAPACK Users' Guide, Third Edition , 1999, Software, Environments and Tools.

[19]  John N. Shadid,et al.  Aztec user`s guide. Version 1 , 1995 .

[20]  Jack Dongarra,et al.  Templates for the Solution of Algebraic Eigenvalue Problems , 2000, Software, environments, tools.

[21]  Sivan Toledo,et al.  Improving the memory-system performance of sparse-matrix vector multiplication , 1997, IBM J. Res. Dev..

[22]  Ed Anderson,et al.  LAPACK Users' Guide , 1995 .

[23]  Roldan Pozo,et al.  Template Numerical Toolkit for Linear Algebra: High Performance Programming With C++ and the Standard Template Library , 1997, Int. J. High Perform. Comput. Appl..

[24]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[25]  K. Pingali,et al.  Compiling Parallel Code for Sparse Matrix Applications , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[26]  Michael E. Wolf,et al.  Improving locality and parallelism in nested loops , 1992 .

[27]  Michael E. Wolf,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[28]  Vipin Kumar,et al.  Hmetis: a hypergraph partitioning package , 1998 .

[29]  A. H. Sherman,et al.  Comparative Analysis of the Cuthill–McKee and the Reverse Cuthill–McKee Ordering Algorithms for Sparse Matrices , 1976 .

[30]  Karim Esseghir Improving data locality for caches , 1993 .

[31]  Alejandro L. Garcia,et al.  Scientific Computing, An Introductory Survey , 1998 .

[32]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[33]  Jack J. Dongarra,et al.  An extended set of FORTRAN basic linear algebra subprograms , 1988, TOMS.

[34]  Mark T. Jones,et al.  BlockSolve95 users manual: Scalable library software for the parallel solution of sparse linear systems , 1995 .

[35]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[36]  Keshav Pingali,et al.  Data-centric multi-level blocking , 1997, PLDI '97.

[37]  Matteo Frigo A Fast Fourier Transform Compiler , 1999, PLDI.

[38]  David R. O'Hallaron Spark98: Sparse Matrix Kernels for Shared Memory and Message Passing Systems , 1997 .

[39]  Leonid Oliker,et al.  Ordering Unstructured Meshes for Sparse Matrix Computations on Leading Parallel Systems , 2000, IPDPS Workshops.

[40]  Aart J. C. Bik,et al.  Compiler support for sparse matrix computations , 1996 .

[41]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[42]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[43]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[44]  Ken Kennedy,et al.  Improving the ratio of memory operations to floating-point operations in loops , 1994, TOPL.

[45]  Bruce Hendrickson,et al.  A Multi-Level Algorithm For Partitioning Graphs , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[46]  M. Sadkane Block-Arnoldi and Davidson methods for unsymmetric large eigenvalue problems , 1993 .

[47]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[48]  C. L. Lawson,et al.  Development of a block Lanczos algorithm for free vibration analysis of spinning structures , 1988 .

[49]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[50]  Mithuna Thottethodi,et al.  Tuning Strassen's Matrix Multiplication for Memory Efficiency , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[51]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[52]  A. Lumsdaine,et al.  A Sparse Matrix Library in C + + for High PerformanceArchitectures , 1994 .

[53]  Steve Carr,et al.  Combining optimization for cache and instruction-level parallelism , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.