Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory Title Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms Permalink

We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore specific-optimization methodologies for important scientific computations. In this work, we examine sparse matrix-vector multiply (SpMV) – one of the most heavily used kernels in scientific computing – across a broad spectrum of multicore designs. Our experimental platform includes the homogeneous AMD quad-core, AMD dual-core, and Intel quad-core designs, the heterogeneous STI Cell, as well as one of the first scientific studies of the highly multithreaded Sun Victoria Falls (a Niagara2 SMP). We present several optimization strategies especially effective for the multicore environment, and demonstrate significant performance improvements compared to existing state-of-the-art serial and parallel SpMV implementations. Additionally, we present key insights into the architectural trade-offs of leading multicore design strategies, in the context of demanding memory-bound numerical algorithms.

[1]  Nectarios Koziris,et al.  Optimizing sparse matrix-vector multiplication using index and value compression , 2008, CF '08.

[2]  Ankit Jain pOSKI : An Extensible Autotuning Framework to Perform Optimized SpMVs on Multicore Architectures , 2008 .

[3]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[4]  Samuel Williams,et al.  Scientific Computing Kernels on the Cell Processor , 2007, International Journal of Parallel Programming.

[5]  James Demmel,et al.  When cache blocking of sparse matrix vector multiply works and why , 2007, Applicable Algebra in Engineering, Communication and Computing.

[6]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[7]  Andrew Lumsdaine,et al.  Accelerating sparse matrix computations via data compression , 2006, ICS '06.

[8]  Michael Gschwind Chip multiprocessing and the cell broadband engine , 2006, CF '06.

[9]  Martin Hopkins,et al.  Synergistic Processing in Cell's Multicore Architecture , 2006, IEEE Micro.

[10]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[11]  Brendan Vastenhouw,et al.  A Two-Dimensional Data Distribution Method for Parallel Sparse Matrix-Vector Multiplication , 2005, SIAM Rev..

[12]  Benjamin C. Lee,et al.  Performance models for evaluation and automatic tuning of symmetric sparse matrix-vector multiply , 2004, International Conference on Parallel Processing, 2004. ICPP 2004..

[13]  Richard W. Vuduc,et al.  Sparsity: Optimization Framework for Sparse Matrix Kernels , 2004, Int. J. High Perform. Comput. Appl..

[14]  Larry Carter,et al.  Sparse Tiling for Stationary Iterative Methods , 2004, Int. J. High Perform. Comput. Appl..

[15]  Richard Vuduc,et al.  Automatic performance tuning of sparse matrix kernels , 2003 .

[16]  Memory Hierarchy Optimizations and Performance Bounds for Sparse A T Ax , 2003 .

[17]  Katherine Yelick,et al.  Automatic Performance Tuning and Analysis of Sparse Triangular Solve , 2002 .

[18]  Roman Geus,et al.  Towards a fast parallel sparse matrix-vector multiplication , 2000, PARCO.

[19]  Shekhar Y. Borkar,et al.  Design challenges of technology scaling , 1999, IEEE Micro.

[20]  P. Sadayappan,et al.  On improving the performance of sparse matrix-vector multiplication , 1997, Proceedings Fourth International Conference on High-Performance Computing.

[21]  Sivan Toledo,et al.  Improving the memory-system performance of sparse-matrix vector multiplication , 1997, IBM J. Res. Dev..

[22]  D. Bailey Little ’ s Law and High Performance Computing , 1997 .

[23]  William Gropp,et al.  Efficient Management of Parallelism in Object-Oriented Numerical Software Libraries , 1997, SciTools.

[24]  Olivier Temam,et al.  Characterizing the behavior of sparse algorithms on caches , 1992, Proceedings Supercomputing '92.

[25]  D. Rose A GRAPH-THEORETIC STUDY OF THE NUMERICAL SOLUTION OF SPARSE POSITIVE DEFINITE SYSTEMS OF LINEAR EQUATIONS , 1972 .

[26]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .