Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks

This paper introduces a storage format for sparse matrices, called <b><i>compressed sparse blocks (CSB)</i></b>, which allows both <i>Ax</i> and <i>A</i>,<i>x</i> to be computed efficiently in parallel, where <i>A</i> is an <i>n</i>×<i>n</i> sparse matrix with <i>nnz</i>e<i>n</i> nonzeros and <i>x</i> is a dense <i>n</i>-vector. Our algorithms use Θ(<i>nnz</i>) work (serial running time) and Θ(√<i>n</i>lg<i>n</i>) span (critical-path length), yielding a parallelism of Θ(<i>nnz</i>/√<i>n</i>lg<i>n</i>), which is amply high for virtually any large matrix. The storage requirement for CSB is the same as that for the more-standard compressed-sparse-rows (CSR) format, for which computing <i>Ax</i> in parallel is easy but <i>A</i>,<i>x</i> is difficult. Benchmark results indicate that on one processor, the CSB algorithms for <i>Ax</i> and <i>A</i>,<i>x</i> run just as fast as the CSR algorithm for <i>Ax</i>, but the CSB algorithms also scale up linearly with processors until limited by off-chip memory bandwidth.

[1]  H. Markowitz The Elimination form of the Inverse and its Application to Linear Programming , 1957 .

[2]  Kenneth E. Iverson,et al.  A programming language , 1899, AIEE-IRE '62 (Spring).

[3]  William F. Tinney,et al.  Techniques for Exploiting the Sparsity or the Network Admittance Matrix , 1963 .

[4]  J. W. Walker,et al.  Direct solutions of sparse network equations by optimally ordered triangular factorization , 1967 .

[5]  E. Cuthill,et al.  Reducing the bandwidth of sparse symmetric matrices , 1969, ACM '69.

[6]  Alan George,et al.  Computer Solution of Large Sparse Positive Definite , 1981 .

[7]  Stanley C. Eisenstat,et al.  Yale sparse matrix package I: The symmetric codes , 1982 .

[8]  D.A. Calahan,et al.  Computer solution of large positive definite systems , 1982, Proceedings of the IEEE.

[9]  Paul L. Mills The design of bit parallel systolic algorithms for matrix-vector and matrix-matrix multiplication , 1985, CSC '85.

[10]  Bjarne Stroustrup,et al.  C++ Programming Language , 1986, IEEE Softw..

[11]  I. Duff,et al.  Direct Methods for Sparse Matrices , 1987 .

[12]  John R. Gilbert,et al.  Sparse Matrices in MATLAB: Design and Implementation , 1992, SIAM J. Matrix Anal. Appl..

[13]  Guy E. Blelloch,et al.  Programming parallel algorithms , 1996, CACM.

[14]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[15]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[16]  Sivan Toledo,et al.  Improving the memory-system performance of sparse-matrix vector multiplication , 1997, IBM J. Res. Dev..

[17]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[18]  Gary L. Miller,et al.  Geometric Mesh Partitioning: Implementation and Experiments , 1998, SIAM J. Sci. Comput..

[19]  C. Leiserson,et al.  Scheduling multithreaded computations by work stealing , 1999, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[20]  Jack Dongarra,et al.  Templates for the Solution of Algebraic Eigenvalue Problems , 2000, Software, environments, tools.

[21]  Ümit V. Çatalyürek,et al.  A fine-grain hypergraph model for 2D decomposition of sparse matrices , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[22]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[23]  Richard W. Vuduc,et al.  Sparsity: Optimization Framework for Sparse Matrix Kernels , 2004, Int. J. High Perform. Comput. Appl..

[24]  John R. Gilbert,et al.  Sparse Matrices in Matlab*P: Design and Implementation , 2004, HiPC.

[25]  Erik Elmroth,et al.  SIAM REVIEW c ○ 2004 Society for Industrial and Applied Mathematics Vol. 46, No. 1, pp. 3–45 Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software ∗ , 2022 .

[26]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[27]  Christos Faloutsos,et al.  Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication , 2005, PKDD.

[28]  Brendan Vastenhouw,et al.  A Two-Dimensional Data Distribution Method for Parallel Sparse Matrix-Vector Multiplication , 2005, SIAM Rev..

[29]  Sriram Raghavan,et al.  Stanford WebBase components and applications , 2006, TOIT.

[30]  David S. Wise,et al.  Seven at one stroke: results from a cache-oblivious paradigm for scalable matrix algorithms , 2006, MSPC '06.

[31]  Andrew Lumsdaine,et al.  Accelerating sparse matrix computations via data compression , 2006, ICS '06.

[32]  David S. Wise,et al.  Analyzing block locality in Morton-order and Morton-hybrid matrices , 2006, MEDEA '06.

[33]  Timothy A. Davis,et al.  Direct methods for sparse linear systems , 2006, Fundamentals of algorithms.

[34]  David S. Wise,et al.  Analyzing block locality in Morton-order and Morton-hybrid matrices , 2007, CARN.

[35]  James Demmel,et al.  When cache blocking of sparse matrix vector multiply works and why , 2007, Applicable Algebra in Engineering, Communication and Computing.

[36]  John R. Gilbert,et al.  On the representation and multiplication of hypersparse matrices , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[37]  Nectarios Koziris,et al.  Optimizing sparse matrix-vector multiplication using index and value compression , 2008, CF '08.

[38]  Rajeev Raman,et al.  Converting to and from Dilated Integers , 2008, IEEE Transactions on Computers.

[39]  Matteo Frigo,et al.  Reducers and other Cilk++ hyperobjects , 2009, SPAA '09.

[40]  Ronald L. Rivest,et al.  Introduction to Algorithms, third edition , 2009 .

[41]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2009, Parallel Comput..

[42]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.