Memory Hierarchy Optimizations and Performance ounds for Sparse A

This paper presents uniprocessor performance optimizations, automatic tuning techniques, and an experimental analysis of the sparse matrix operation, y = AT Ax, where A is a sparse matrix and x, y are dense vectors. We describe an implementation of this computational kernel which brings A through the memory hierarchy only once, and which can be combined naturally with the register blocking optimization previously proposed in the Sparsity tuning system for sparse matrix-vector multiply. We evaluate these optimizations on a benchmark set of 44 matrices and 4 platforms, showing speedups of up to 4.2×. We also develop platform-specific upper-bounds on the performance of these implementations. We analyze how closely we can approach these bounds, and show when low-level tuning techniques (e.g., better instruction scheduling) are likely to yield a significant pay-off. Finally, we propose a hybrid off-line/run-time heuristic which in practice automatically selects nearoptimal values of the key tuning parameters, the register block sizes.

[1]  Todd L. Veldhuizen,et al.  Arrays in Blitz++ , 1998, ISCOPE.

[2]  Siddhartha Chatterjee,et al.  Exact analysis of the cache behavior of nested loops , 2001, PLDI '01.

[3]  Roman Geus,et al.  Towards a fast parallel sparse matrix-vector multiplication , 2000, PARCO.

[4]  Dragan Mirkovic,et al.  An adaptive software library for fast Fourier transforms , 2000, ICS '00.

[5]  Sharad Malik,et al.  Cache miss equations: a compiler framework for analyzing and tuning memory behavior , 1999, TOPL.

[6]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[7]  Olivier Temam,et al.  Characterizing the behavior of sparse algorithms on caches , 1992, Proceedings Supercomputing '92.

[8]  David E. Keyes,et al.  Towards Realistic Performance Bounds for Implicit CFD Codes , 2000 .

[9]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[10]  Roldan Pozo,et al.  NIST sparse BLAS user's guide , 2001 .

[11]  Eun Im,et al.  Optimizing the Performance of Sparse Matrix-Vector Multiplication , 2000 .

[12]  Rafael Hector Saavedra-Barrera,et al.  CPU performance evaluation and execution time prediction using narrow spectrum benchmarking , 1992 .

[13]  Stefan Andersson,et al.  RS/6000 Scientific and Technical Computing: POWER3 Introduction and Tuning Guide , 1998 .

[14]  Francisco F. Rivera,et al.  Modeling and Improving Locality for Irregular Problems: Sparse Matrix-Vector Product on Cache Memories as a Cache Study , 1999, HPCN Europe.

[15]  Emilio L. Zapata,et al.  Memory Hierarchy Performance Prediction for Blocked Sparse Algorithms , 1999, Parallel Process. Lett..

[16]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[17]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[18]  José M. F. Moura,et al.  Fast Automatic Generation of DSP Algorithms , 2001, International Conference on Computational Science.

[19]  Robert A. van de Geijn,et al.  FLAME: Formal Linear Algebra Methods Environment , 2001, TOMS.

[20]  William Kahan,et al.  Document for the Basic Linear Algebra Subprograms (BLAS) standard: BLAS Technical Forum , 2001 .

[21]  Katherine A. Yelick,et al.  Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY , 2001, International Conference on Computational Science.

[22]  A. Pinar,et al.  Improving Performance of Sparse Matrix-Vector Multiplication , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[23]  Weichung Wang,et al.  Adaptive use of iterative methods in interior point methods for linear programming , 1995 .

[24]  Katherine Yelick,et al.  Automatic Performance Tuning and Analysis of Sparse Triangular Solve , 2002 .

[25]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[26]  Jack J. Dongarra,et al.  A Scalable Cross-Platform Infrastructure for Application Performance Tuning Using Hardware Counters , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[27]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[28]  Michele Colajanni,et al.  PSBLAS: a library for parallel linear algebra computation on sparse matrices , 2000, TOMS.

[29]  James Demmel,et al.  Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[30]  James Demmel,et al.  Applied Numerical Linear Algebra , 1997 .

[31]  Josep-Lluís Larriba-Pey,et al.  Block algorithms for sparse matrix computations on high performance workstations , 1996, ICS '96.

[32]  P. Sadayappan,et al.  On improving the performance of sparse matrix-vector multiplication , 1997, Proceedings Fourth International Conference on High-Performance Computing.

[33]  Jeremy G. Siek,et al.  A Rational Approach to Portable High Performance: The Basic Linear Algebra Instruction Set (BLAIS) and the Fixed Algorithm Size Template (FAST) Library , 1998, ECOOP Workshops.

[34]  Sathish S. Vadhiyar,et al.  Towards an Accurate Model for Collective Communications , 2001, Int. J. High Perform. Comput. Appl..

[35]  Sivan Toledo,et al.  Improving the memory-system performance of sparse-matrix vector multiplication , 1997, IBM J. Res. Dev..

[36]  William Gropp,et al.  Efficient Management of Parallelism in Object-Oriented Numerical Software Libraries , 1997, SciTools.

[37]  Aart J. C. Bik,et al.  Automatic Nonzero Structure Analysis , 1999, SIAM J. Comput..

[38]  Ken Kennedy,et al.  Compiler blockability of numerical algorithms , 1992, Proceedings Supercomputing '92.

[39]  Paul Vinson Stodghill,et al.  A Relational Approach to the Automatic Generation of Sequential Sparse matrix Codes , 1997 .