Performance Optimizations and Bounds for Sparse Matrix Kernels

Building high-performance implementations of sparse matrix-vector multiply (SpM×V), an important and ubiquitous computational kernel, is fundamentally limited by a variety of factors: the increasing performance gap between processors and memory, the storage and instruction overhead of manipulating sparse data structures, and the irregular memory access due to sparse storage. Moreover, the complexity of modeling execution of modern microprocessors makes selecting the best data structure and SpM×V implementation for a given sparse matrix a difficult task. In this paper, we consider a range of performance bounds and models for SpM×V, both practical and hypothetical, which quantify these limits. The models vary both in cost and by what information is assumed— from the purely static to those that assume perfect knowledge of run-time information available via processor hardware counters. We evaluate these models and bounds on a variety of hardware platforms and matrices, and show that the task of selecting the best implementation and data structure for a particular matrix will require a combination of modeling and runtime searching. Furthermore, we use our performance bounds to show that our previously developed optimization technique, register blocking, which improves register-level reuse by exploiting naturally occurring dense subblocks, is unlikely to be improved upon significantly by additional low-level instruction scheduling efforts. Instead, we examine our recent efforts to overcome the fundamental limits through the use other kernels: multiplication with symmetric matrices, by multiple vectors, and higher-level kernels like multiplication of a vector by AA. ∗CS Division and Department of Mathematics

[1]  Michael F. P. O'Boyle,et al.  The effect of cache models on iterative compilation for combined tiling and unrolling , 2004, Concurr. Comput. Pract. Exp..

[2]  Siddhartha Chatterjee,et al.  Exact analysis of the cache behavior of nested loops , 2001, PLDI '01.

[3]  Larry Carter,et al.  Rescheduling for Locality in Sparse Matrix Computations , 2001, International Conference on Computational Science.

[4]  Larry Carter,et al.  A Modal Model of Memory , 2001, International Conference on Computational Science.

[5]  Katherine A. Yelick,et al.  Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY , 2001, International Conference on Computational Science.

[6]  Roldan Pozo,et al.  NIST sparse BLAS user's guide , 2001 .

[7]  Greg M. Henry,et al.  Flexible High-Performance Matrix Multiply via a Self-Modifying Runtime Code , 2001 .

[8]  Jack J. Dongarra,et al.  A Scalable Cross-Platform Infrastructure for Application Performance Tuning Using Hardware Counters , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[9]  Eun Im,et al.  Optimizing the Performance of Sparse Matrix-Vector Multiplication , 2000 .

[10]  Sharad Malik,et al.  Cache miss equations: a compiler framework for analyzing and tuning memory behavior , 1999, TOPL.

[11]  Aart J. C. Bik,et al.  Automatic Nonzero Structure Analysis , 1999, SIAM J. Comput..

[12]  Francisco F. Rivera,et al.  Modeling and Improving Locality for Irregular Problems: Sparse Matrix-Vector Product on Cache Memories as a Cache Study , 1999, HPCN Europe.

[13]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[14]  P. Sadayappan,et al.  On improving the performance of sparse matrix-vector multiplication , 1997, Proceedings Fourth International Conference on High-Performance Computing.

[15]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[16]  Richard F. Barrett,et al.  Matrix Market: a web resource for test matrix collections , 1996, Quality of Numerical Software.

[17]  Josep-Lluís Larriba-Pey,et al.  Block algorithms for sparse matrix computations on high performance workstations , 1996, ICS '96.

[18]  Richard Barrett,et al.  Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.

[19]  Ken Kennedy,et al.  Compiler blockability of numerical algorithms , 1992, Proceedings Supercomputing '92.

[20]  Olivier Temam,et al.  Characterizing the behavior of sparse algorithms on caches , 1992, Proceedings Supercomputing '92.

[21]  Rafael Hector Saavedra-Barrera,et al.  CPU performance evaluation and execution time prediction using narrow spectrum benchmarking , 1992 .

[22]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.