Optimization of sparse matrix-vector multiplication on emerging multicore platforms

We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore specific-optimization methodologies for important scientific computations. In this work, we examine sparse matrix-vector multiply (SpMV) - one of the most heavily used kernels in scientific computing - across a broad spectrum of multicore designs. Our experimental platform includes the homogeneous AMD quad-core, AMD dual-core, and Intel quad-core designs, the heterogeneous STI Cell, as well as one of the first scientific studies of the highly multithreaded Sun Victoria Falls (a Niagara2 SMP). We present several optimization strategies especially effective for the multicore environment, and demonstrate significant performance improvements compared to existing state-of-the-art serial and parallel SpMV implementations. Additionally, we present key insights into the architectural trade-offs of leading multicore design strategies, in the context of demanding memory-bound numerical algorithms.

[1]  Nectarios Koziris,et al.  Optimizing sparse matrix-vector multiplication using index and value compression , 2008, CF '08.

[2]  Richard W. Vuduc,et al.  Sparsity: Optimization Framework for Sparse Matrix Kernels , 2004, Int. J. High Perform. Comput. Appl..

[3]  Michael Gschwind Chip multiprocessing and the cell broadband engine , 2006, CF '06.

[4]  A. Pinar,et al.  Improving Performance of Sparse Matrix-Vector Multiplication , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[5]  Brendan Vastenhouw,et al.  A Two-Dimensional Data Distribution Method for Parallel Sparse Matrix-Vector Multiplication , 2005, SIAM Rev..

[6]  Roman Geus,et al.  Towards a fast parallel sparse matrix-vector multiplication , 2000, PARCO.

[7]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[8]  Olivier Temam,et al.  Characterizing the behavior of sparse algorithms on caches , 1992, Proceedings Supercomputing '92.

[9]  James Demmel,et al.  Memory Hierarchy Optimizations and Performance ounds for Sparse A , 2003, International Conference on Computational Science.

[10]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[11]  Martin Hopkins,et al.  Synergistic Processing in Cell's Multicore Architecture , 2006, IEEE Micro.

[12]  D. Rose A GRAPH-THEORETIC STUDY OF THE NUMERICAL SOLUTION OF SPARSE POSITIVE DEFINITE SYSTEMS OF LINEAR EQUATIONS , 1972 .

[13]  Shekhar Y. Borkar,et al.  Design challenges of technology scaling , 1999, IEEE Micro.

[14]  Andrew Lumsdaine,et al.  Accelerating sparse matrix computations via data compression , 2006, ICS '06.

[15]  Guy E. Blelloch,et al.  AD-A 270 601 Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors , 1993 .

[16]  Samuel Williams,et al.  Scientific computing Kernels on the cell processor , 2007 .

[17]  Roman Geus,et al.  Towards a fast parallel sparse symmetric matrix-vector multiplication , 2001, Parallel Comput..

[18]  Richard Vuduc,et al.  Automatic performance tuning of sparse matrix kernels , 2003 .

[19]  P. Sadayappan,et al.  On improving the performance of sparse matrix-vector multiplication , 1997, Proceedings Fourth International Conference on High-Performance Computing.

[20]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[21]  James Demmel,et al.  When cache blocking of sparse matrix vector multiply works and why , 2007, Applicable Algebra in Engineering, Communication and Computing.

[22]  David A. Patterson,et al.  Computer Architecture, Fifth Edition: A Quantitative Approach , 2011 .

[23]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[24]  William Gropp,et al.  Efficient Management of Parallelism in Object-Oriented Numerical Software Libraries , 1997, SciTools.

[25]  Sivan Toledo,et al.  Improving the memory-system performance of sparse-matrix vector multiplication , 1997, IBM J. Res. Dev..

[26]  Katherine Yelick,et al.  Performance models for evaluation and automatic tuning of symmetric sparse matrix-vector multiply , 2004 .

[27]  Larry Carter,et al.  Sparse Tiling for Stationary Iterative Methods , 2004, Int. J. High Perform. Comput. Appl..

[28]  Katherine Yelick,et al.  Automatic Performance Tuning and Analysis of Sparse Triangular Solve , 2002 .