Generalised vectorisation for sparse matrix: vector multiplication

This work generalises the various ways in which a sparse matrix--vector (SpMV) multiplication can be vectorised. It arrives at a novel data structure that generalises three earlier well-known data structures for sparse computations: the Blocked CRS format, the (sliced) ELLPACK format, and segmented scan based formats. The new data structure is relevant since efficient use of new hardware requires the use of increasingly wide vector registers. Normally, the use of vectorisation for sparse computations is limited due to bandwidth constraints. In cases where computations are limited by memory latencies instead of memory bandwidth, however, vectorisation can still help performance. The Intel Xeon Phi, appearing as a component in several top-500 supercomputers, displays exactly this behaviour for SpMV multiplication. On this architecture the use of the new generalised vectorisation scheme increases performance up to 178 percent.

[1]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[2]  Shengen Yan,et al.  yaSpMV: yet another SpMV framework on GPUs , 2014, PPoPP.

[3]  Alejandro Duran,et al.  An OpenMP* Barrier Using SIMD Instructions for Intel® Xeon PhiTM Coprocessor , 2013, IWOMP.

[4]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[5]  Satoshi Matsuoka,et al.  Cache-aware sparse matrix formats for Kepler GPU , 2014, 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS).

[6]  A. N. Yzelman,et al.  A Cache-Oblivious Sparse Matrix–Vector Multiplication Scheme Based on the Hilbert Curve , 2012 .

[7]  Xing Liu,et al.  Efficient sparse matrix-vector multiplication on x86-based many-core processors , 2013, ICS '13.

[8]  R. Neveling,et al.  J. Phys. Conf. Series , 2012 .

[9]  Eduardo F. D'Azevedo,et al.  Vectorized Sparse Matrix Multiply for Compressed Row Storage Format , 2005, International Conference on Computational Science.

[10]  A. Lumsdaine,et al.  A Sparse Matrix Library in C + + for High PerformanceArchitectures , 1994 .

[11]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[12]  Rob H. Bisseling,et al.  MulticoreBSP for C: A High-Performance Library for Shared-Memory Parallel Programming , 2013, International Journal of Parallel Programming.

[13]  Rob H. Bisseling,et al.  Cache-Oblivious Sparse Matrix--Vector Multiplication by Using Sparse Matrix Partitioning Methods , 2009, SIAM J. Sci. Comput..

[14]  Michael Garland,et al.  Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[15]  P. Sadayappan,et al.  On improving the performance of sparse matrix-vector multiplication , 1997, Proceedings Fourth International Conference on High-Performance Computing.

[16]  Guy E. Blelloch,et al.  AD-A 270 601 Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors , 1993 .

[17]  John R. Gilbert,et al.  Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks , 2009, SPAA '09.

[18]  Roxana Ionutiu,et al.  SparseRC: Sparsity Preserving Model Reduction for RC Circuits With Many Terminals , 2011, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[19]  John R. Rice,et al.  Solving elliptic problems using ELLPACK , 1985, Springer series in computational mathematics.

[20]  Sebastian Schöps,et al.  Progress in Industrial Mathematics at ECMI 2010 , 2012 .

[21]  Bora Uçar,et al.  A scalable hybrid linear solver based on combinatorial algorithms , 2012 .

[22]  Gerhard Wellein,et al.  A Unified Sparse Matrix Data Format for Efficient General Sparse Matrix-Vector Multiplication on Modern Processors with Wide SIMD Units , 2013, SIAM J. Sci. Comput..

[23]  James Demmel,et al.  When cache blocking of sparse matrix vector multiply works and why , 2007, Applicable Algebra in Engineering, Communication and Computing.

[24]  Dirk Roose,et al.  High-level strategies for parallel shared-memory sparse matrix – vector multiplication , 2012 .

[25]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[26]  Stanley C. Eisenstat,et al.  Yale sparse matrix package I: The symmetric codes , 1982 .

[27]  Ümit V. Çatalyürek,et al.  Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi , 2013, PPAM.

[28]  Jack Dongarra,et al.  LAPACK Working Note 74: A Sparse Matrix Library in C++ for High Performance Architectures , 1994 .

[29]  Rob H. Bisseling,et al.  Two-dimensional cache-oblivious sparse matrix-vector multiplication , 2011, Parallel Comput..

[30]  Jack Dongarra,et al.  Templates for the Solution of Algebraic Eigenvalue Problems , 2000, Software, environments, tools.