Efficient sparse matrix multiple-vector multiplication using a bitmapped format

The problem of obtaining high computational throughput from sparse matrix multiple-vector multiplication routines is considered. Current sparse matrix formats and algorithms have high bandwidth requirements and poor reuse of cache and register loaded entries, which restrict their performance. We propose the mapped blocked row format: a bitmapped sparse matrix format that stores entries as blocks without a fill overhead, thereby offering blocking without additional storage and bandwidth overheads. An efficient algorithm decodes bitmaps using de Bruijn sequences and minimizes the number of conditionals evaluated. Performance is compared with that of popular formats, including vendor implementations of sparse BLAS. Our sparse matrix multiple-vector multiplication algorithm achieves high throughput on all platforms and is implemented using platform neutral optimizations.

[1]  James Demmel,et al.  Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[2]  Sivan Toledo,et al.  Improving the memory-system performance of sparse-matrix vector multiplication , 1997, IBM J. Res. Dev..

[3]  M. T. F. Cunha,et al.  Streaming SIMD Extensions applied to boundary element codes , 2008, Adv. Eng. Softw..

[4]  Firas Hamze,et al.  Importance of explicit vectorization for CPU and GPU software performance , 2010, J. Comput. Phys..

[5]  K. H. Randall,et al.  Using de Bruijn Sequences to Index a 1 in a Computer Word , 1998 .

[6]  Peter Stone,et al.  Improving particle filter performance using SSE instructions , 2009, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[7]  Richard Barrett,et al.  Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.

[8]  Katherine Yelick,et al.  Performance Optimizations and Bounds for Sparse Symmetric Matrix-Multiple Vector Multiply , 1985 .

[9]  Katherine A. Yelick,et al.  Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY , 2001, International Conference on Computational Science.

[10]  Samuel Williams,et al.  Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[11]  J I Agulleiro,et al.  Vectorization with SIMD extensions speeds up reconstruction in electron tomography. , 2010, Journal of structural biology.

[12]  Sally A. McKee,et al.  Reflections on the memory wall , 2004, CF '04.

[13]  P. Sadayappan,et al.  On improving the performance of sparse matrix-vector multiplication , 1997, Proceedings Fourth International Conference on High-Performance Computing.

[14]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[15]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[16]  Pawel Gepner,et al.  Early performance evaluation of AVX for HPC , 2011, ICCS.