论文信息 - Fast Sparse Matrix-Vector Multiplication by Exploiting Variable Block Structure

Fast Sparse Matrix-Vector Multiplication by Exploiting Variable Block Structure

We improve the performance of sparse matrix-vector multiplication(SpMV) on modern cache-based superscalar machines when the matrix structure consists of multiple, irregularly aligned rectangular blocks. Matrices from finite element modeling applications often have this structure. We split the matrix, A, into a sum, A1 + A2 + ... + As, where each term is stored in a new data structure we refer to as unaligned block compressed sparse row (UBCSR) format. A classical approach which stores A in a BCSR can also reduce execution time, but the improvements may be limited because BCSR imposes an alignment of the matrix non-zeros that leads to extra work from filled-in zeros. Combining splitting with UBCSR reduces this extra work while retaining the generally lower memory bandwidth requirements and register-level tiling opportunities of BCSR. We show speedups can be as high as 2.1× over no blocking, and as high as 1.8× over BCSR as used in prior work on a set of application matrices. Even when performance does not improve significantly, split UBCSR usually reduces matrix storage.

Hyun Jin Moon | Richard W. Vuduc | R. Vuduc | H. J. Moon

[1] Jack Dongarra,et al. Computational Science - ICCS 2005, 5th International Conference, Atlanta, GA, USA, May 22-25, 2005, Proceedings, Part I , 2005, International Conference on Computational Science.

[2] Susan T. Dumais,et al. Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[3] Larry Carter,et al. Rescheduling for Locality in Sparse Matrix Computations , 2001, International Conference on Computational Science.

[4] Patrick R. Amestoy,et al. An Approximate Minimum Degree Ordering Algorithm , 1996, SIAM J. Matrix Anal. Appl..

[5] Elizabeth R. Jessup,et al. A Technique for Accelerating the Convergence of Restarted GMRES , 2005, SIAM J. Matrix Anal. Appl..

[6] Roldan Pozo,et al. NIST sparse BLAS user's guide , 2001 .

[7] M. SIAMJ.. FAST NESTED DISSECTION FOR FINITE ELEMENT MESHES , 1997 .

[8] Eun Im,et al. Optimizing the Performance of Sparse Matrix-Vector Multiplication , 2000 .

[9] Li Chen,et al. Parallel Finite Element Analysis Platform for the Earth Simulator: GeoFEM , 2003, International Conference on Computational Science.

[10] Aart J. C. Bik,et al. Automatic Nonzero Structure Analysis , 1999, SIAM J. Comput..

[11] James Demmel,et al. Performance models for evaluation and automatic tuning of symmetric sparse matrix-vector multiply , 2004, International Conference on Parallel Processing, 2004. ICPP 2004..

[12] Francisco F. Rivera,et al. Modeling and Improving Locality for Irregular Problems: Sparse Matrix-Vector Product on Cache Memories as a Cache Study , 1999, HPCN Europe.

[13] James Demmel,et al. Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[14] Eduardo F. D'Azevedo,et al. Vectorized Sparse Matrix Multiply for Compressed Row Storage Format , 2005, International Conference on Computational Science.

[15] H. Wilf,et al. Direct Solutions of Sparse Network Equations by Optimally Ordered Triangular Factorization , 1967 .

[16] Victor Eijkhout,et al. Performance Optimization and Modeling of Blocked Sparse Kernels , 2007, Int. J. High Perform. Comput. Appl..

[17] Roldan Pozo,et al. NIST Sparse BLAS User's Guide | NIST , 2001 .

[18] Gerd Heber,et al. Self-Avoiding Walks over Adaptive Unstructured Grids , 1999, Concurr. Pract. Exp..

[19] Vipin Kumar,et al. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[20] Ali Pinar,et al. Finding Nonoverlapping Dense Blocks of a Sparse Matrix , 2004 .

[21] Sivan Toledo,et al. Nested-Dissection Orderings for Sparse LU with Partial Pivoting , 2000, SIAM J. Matrix Anal. Appl..

[22] Michael B. Giles,et al. Renumbering unstructured grids to improve the performance of codes on hierarchical memory machines , 1997 .

[23] D. Rose. A GRAPH-THEORETIC STUDY OF THE NUMERICAL SOLUTION OF SPARSE POSITIVE DEFINITE SYSTEMS OF LINEAR EQUATIONS , 1972 .

[24] Richard Vuduc,et al. Automatic performance tuning of sparse matrix kernels , 2003 .

[25] Richard W. Vuduc,et al. Sparsity: Optimization Framework for Sparse Matrix Kernels , 2004, Int. J. High Perform. Comput. Appl..

[26] E. Im,et al. Optimizing Sparse Matrix Vector Multiplication on SMP , 1999, PPSC.

[27] Sivan Toledo,et al. Improving the memory-system performance of sparse-matrix vector multiplication , 1997, IBM J. Res. Dev..

[28] E. Cuthill,et al. Reducing the bandwidth of sparse symmetric matrices , 1969, ACM '69.

[29] A. George. Nested Dissection of a Regular Finite Element Mesh , 1973 .

[30] P. Sadayappan,et al. On improving the performance of sparse matrix-vector multiplication , 1997, Proceedings Fourth International Conference on High-Performance Computing.

[31] Roman Geus,et al. Towards a fast parallel sparse matrix-vector multiplication , 2000, PARCO.

[32] Olivier Temam,et al. Characterizing the behavior of sparse algorithms on caches , 1992, Proceedings Supercomputing '92.

[33] James Demmel,et al. When cache blocking of sparse matrix vector multiply works and why , 2007, Applicable Algebra in Engineering, Communication and Computing.