论文信息 - Sparse matrix-vector multiply on the HICAMP architecture

Sparse matrix-vector multiply on the HICAMP architecture

Sparse matrix-vector multiply (SpMV) is a critical task in the inner loop of modern iterative linear system solvers and exhibits very little data reuse. This low reuse means that its performance is bounded by main-memory bandwidth. Moreover, the random patterns of indirection make it difficult to achieve this bound. We present sparse matrix storage formats based on deduplicated memory. These formats reduce memory traffic during SpMV and thus show significantly improved performance bounds: 90x better in the best case. Additionally, we introduce a matrix format that inherently exploits any amount of matrix symmetry and is at the same time fully compatible with non-symmetric matrix code. Because of this, our method can concurrently operate on a symmetric matrix without complicated work partitioning schemes and without any thread synchronization or locking. This approach takes advantage of growing processor caches, but incurs an instruction count overhead. It is feasible to overcome this issue by using specialized hardware as shown by the recently proposed Hierarchical Immutable Content-Addressable Memory Processor, or HICAMP architecture.

Mark Horowitz | David R. Cheriton | John P. Stevenson | Amin Firoozshahian | Alex Solomatnikov

[1] Nectarios Koziris,et al. CSX: an extended compression format for spmv on shared memory systems , 2011, PPoPP '11.

[2] John R. Gilbert,et al. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks , 2009, SPAA '09.

[3] Sivan Toledo,et al. Improving the memory-system performance of sparse-matrix vector multiplication , 1997, IBM J. Res. Dev..

[4] Marcin Paprzycki,et al. Use of hybrid recursive CSR/COO data structures in sparse matrix-vector multiplication , 2010, Proceedings of the International Multiconference on Computer Science and Information Technology.

[5] Samuel Williams,et al. Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[6] 大卫·R·谢里登. Hierarchical immutable content-addressable memory processor , 2008 .

[7] Marcin Dabrowski,et al. Parallel symmetric sparse matrix-vector product on scalar multi-core CPUs , 2010, Parallel Comput..

[8] Nectarios Koziris,et al. Improving the Performance of Multithreaded Sparse Matrix-Vector Multiplication Using Index and Value Compression , 2008, 2008 37th International Conference on Parallel Processing.

[9] Jason D. Bakos,et al. A Sparse Matrix Personality for the Convey HC-1 , 2011, 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines.

[10] Michael Garland,et al. Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[11] Calvin J. Ribbens,et al. Pattern-based sparse matrix representation for memory-efficient SMVM kernels , 2009, ICS.

[12] David R. Cheriton,et al. HICAMP: architectural support for efficient concurrency-safe shared structured data access , 2012, ASPLOS XVII.

[13] Andrew Lumsdaine,et al. Accelerating sparse matrix computations via data compression , 2006, ICS '06.

[14] Samuel Williams,et al. Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2009, Parallel Comput..

[15] Katherine Yelick,et al. OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[16] Gerhard Wellein,et al. LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments , 2010, 2010 39th International Conference on Parallel Processing Workshops.

[17] Guy E. Blelloch,et al. Hierarchical Diagonal Blocking and Precision Reduction Applied to Combinatorial Multigrid , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[18] Nectarios Koziris,et al. Optimizing sparse matrix-vector multiplication using index and value compression , 2008, CF '08.

[19] James Demmel,et al. Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[20] Andrew A. Chien,et al. The future of microprocessors , 2011, Commun. ACM.

[21] Ricardo E. Gonzalez,et al. Xtensa: A Configurable and Extensible Processor , 2000, IEEE Micro.

[22] Gerhard Wellein,et al. LIKWID: Lightweight Performance Tools , 2011, CHPC.

[23] Timothy A. Davis,et al. The university of Florida sparse matrix collection , 2011, TOMS.

[24] Richard W. Vuduc,et al. Sparsity: Optimization Framework for Sparse Matrix Kernels , 2004, Int. J. High Perform. Comput. Appl..