论文信息 - Memory Hierarchy Optimizations and Performance ounds for Sparse A

Memory Hierarchy Optimizations and Performance ounds for Sparse A

This paper presents uniprocessor performance optimizations, automatic tuning techniques, and an experimental analysis of the sparse matrix operation, y = AT Ax, where A is a sparse matrix and x, y are dense vectors. We describe an implementation of this computational kernel which brings A through the memory hierarchy only once, and which can be combined naturally with the register blocking optimization previously proposed in the Sparsity tuning system for sparse matrix-vector multiply. We evaluate these optimizations on a benchmark set of 44 matrices and 4 platforms, showing speedups of up to 4.2×. We also develop platform-specific upper-bounds on the performance of these implementations. We analyze how closely we can approach these bounds, and show when low-level tuning techniques (e.g., better instruction scheduling) are likely to yield a significant pay-off. Finally, we propose a hybrid off-line/run-time heuristic which in practice automatically selects nearoptimal values of the key tuning parameters, the register block sizes.

James Demmel | Richard W. Vuduc | Katherine A. Yelick | Attila Gyulassy

[1] Todd L. Veldhuizen,et al. Arrays in Blitz++ , 1998, ISCOPE.

[2] Siddhartha Chatterjee,et al. Exact analysis of the cache behavior of nested loops , 2001, PLDI '01.

[3] Roman Geus,et al. Towards a fast parallel sparse matrix-vector multiplication , 2000, PARCO.

[4] Dragan Mirkovic,et al. An adaptive software library for fast Fourier transforms , 2000, ICS '00.

[5] Sharad Malik,et al. Cache miss equations: a compiler framework for analyzing and tuning memory behavior , 1999, TOPL.

[6] Chau-Wen Tseng,et al. Improving data locality with loop transformations , 1996, TOPL.

[7] Olivier Temam,et al. Characterizing the behavior of sparse algorithms on caches , 1992, Proceedings Supercomputing '92.

[8] David E. Keyes,et al. Towards Realistic Performance Bounds for Implicit CFD Codes , 2000 .

[9] Monica S. Lam,et al. A data locality optimizing algorithm , 1991, PLDI '91.

[10] Roldan Pozo,et al. NIST sparse BLAS user's guide , 2001 .

[11] Eun Im,et al. Optimizing the Performance of Sparse Matrix-Vector Multiplication , 2000 .

[12] Rafael Hector Saavedra-Barrera,et al. CPU performance evaluation and execution time prediction using narrow spectrum benchmarking , 1992 .

[13] Stefan Andersson,et al. RS/6000 Scientific and Technical Computing: POWER3 Introduction and Tuning Guide , 1998 .

[14] Francisco F. Rivera,et al. Modeling and Improving Locality for Irregular Problems: Sparse Matrix-Vector Product on Cache Memories as a Cache Study , 1999, HPCN Europe.

[15] Emilio L. Zapata,et al. Memory Hierarchy Performance Prediction for Blocked Sparse Algorithms , 1999, Parallel Process. Lett..

[16] Jack J. Dongarra,et al. Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[17] Jack J. Dongarra,et al. Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[18] José M. F. Moura,et al. Fast Automatic Generation of DSP Algorithms , 2001, International Conference on Computational Science.

[19] Robert A. van de Geijn,et al. FLAME: Formal Linear Algebra Methods Environment , 2001, TOMS.

[20] William Kahan,et al. Document for the Basic Linear Algebra Subprograms (BLAS) standard: BLAS Technical Forum , 2001 .

[21] Katherine A. Yelick,et al. Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY , 2001, International Conference on Computational Science.

[22] A. Pinar,et al. Improving Performance of Sparse Matrix-Vector Multiplication , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[23] Weichung Wang,et al. Adaptive use of iterative methods in interior point methods for linear programming , 1995 .

[24] Katherine Yelick,et al. Automatic Performance Tuning and Analysis of Sparse Triangular Solve , 2002 .

[25] Steven G. Johnson,et al. FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[26] Jack J. Dongarra,et al. A Scalable Cross-Platform Infrastructure for Application Performance Tuning Using Hardware Counters , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[27] James Demmel,et al. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[28] Michele Colajanni,et al. PSBLAS: a library for parallel linear algebra computation on sparse matrices , 2000, TOMS.

[29] James Demmel,et al. Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[30] James Demmel,et al. Applied Numerical Linear Algebra , 1997 .

[31] Josep-Lluís Larriba-Pey,et al. Block algorithms for sparse matrix computations on high performance workstations , 1996, ICS '96.

[32] P. Sadayappan,et al. On improving the performance of sparse matrix-vector multiplication , 1997, Proceedings Fourth International Conference on High-Performance Computing.

[33] Jeremy G. Siek,et al. A Rational Approach to Portable High Performance: The Basic Linear Algebra Instruction Set (BLAIS) and the Fixed Algorithm Size Template (FAST) Library , 1998, ECOOP Workshops.

[34] Sathish S. Vadhiyar,et al. Towards an Accurate Model for Collective Communications , 2001, Int. J. High Perform. Comput. Appl..

[35] Sivan Toledo,et al. Improving the memory-system performance of sparse-matrix vector multiplication , 1997, IBM J. Res. Dev..

[36] William Gropp,et al. Efficient Management of Parallelism in Object-Oriented Numerical Software Libraries , 1997, SciTools.

[37] Aart J. C. Bik,et al. Automatic Nonzero Structure Analysis , 1999, SIAM J. Comput..

[38] Ken Kennedy,et al. Compiler blockability of numerical algorithms , 1992, Proceedings Supercomputing '92.

[39] Paul Vinson Stodghill,et al. A Relational Approach to the Automatic Generation of Sequential Sparse matrix Codes , 1997 .