Optimizing sparse matrix vector multiplication on emerging multicores

After hitting the power wall, the dramatic change in computer architecture from single core to multicore/manycore brings us new challenges on high performance computing, especially for the data intensive applications. Sparse matrix-vector multiplication (SpMV) is one of the most important computations in this area, and has therefore received a lot of attention in recent decades. In contrast to the uniform/regular dense matrix computations, SpMV's irregular data access patterns with compact data structure for storage make the SpMV optimization more complex than optimizing regular/dense matrix computation. In this work, we look at the SpMV optimization problem in the context of emerging multicores from a different architecture conscious perspective, and propose an optimization strategy that has three key components: mapping, scheduling and data layout reorganization. Specifically, the mapping component derives a suitable iteration-to-core mapping; the scheduling component determines the execution order of loop iterations assigned to each core in the target multicore architecture; and finally, the data layout reorganization component prepares multiple memory layouts for the source (input) vector customized for different row patterns. A distinguishing characteristic of our approach is that it is cache hierarchy aware, that is, all three components take the underlying cache hierarchy of the target multicore architecture into account, and therefore, the derived solution is, in a sense, customized to the target architecture. We evaluate the proposed strategy using 10 sparse matrices with two different multicore systems. Our experimental evaluation reveals that the proposed optimization algorithm brings significant performance improvements (up to 26.5%) over the unoptimized case.

[1]  Rob H. Bisseling,et al.  Cache-Oblivious Sparse Matrix--Vector Multiplication by Using Sparse Matrix Partitioning Methods , 2009, SIAM J. Sci. Comput..

[2]  Gerhard Wellein,et al.  Hybrid-Parallel Sparse Matrix-Vector Multiplication with Explicit Communication Overlap on Current Multicore-Based Systems , 2011, Parallel Process. Lett..

[3]  Brendan Vastenhouw,et al.  A Two-Dimensional Data Distribution Method for Parallel Sparse Matrix-Vector Multiplication , 2005, SIAM Rev..

[4]  Sivan Toledo,et al.  Improving the memory-system performance of sparse-matrix vector multiplication , 1997, IBM J. Res. Dev..

[5]  Nectarios Koziris,et al.  Performance evaluation of the sparse matrix-vector multiplication on modern architectures , 2009, The Journal of Supercomputing.

[6]  Rudolf Eigenmann,et al.  Adaptive runtime tuning of parallel sparse matrix-vector multiplication on distributed memory systems , 2008, ICS '08.

[7]  Mahmut T. Kandemir,et al.  On-chip cache hierarchy-aware tile scheduling for multicore machines , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[8]  A. Pinar,et al.  Improving Performance of Sparse Matrix-Vector Multiplication , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[9]  Marcin Dabrowski,et al.  Parallel symmetric sparse matrix-vector product on scalar multi-core CPUs , 2010, Parallel Comput..

[10]  Andrew Lumsdaine,et al.  Accelerating sparse matrix computations via data compression , 2006, ICS '06.

[11]  Arutyun Avetisyan,et al.  Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures , 2010, HiPEAC.

[12]  John R. Gilbert,et al.  Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks , 2009, SPAA '09.

[13]  Nectarios Koziris,et al.  Optimizing sparse matrix-vector multiplication using index and value compression , 2008, CF '08.

[14]  Eun Im,et al.  Optimizing the Performance of Sparse Matrix-Vector Multiplication , 2000 .

[15]  Hyun Jin Moon,et al.  Fast Sparse Matrix-Vector Multiplication by Exploiting Variable Block Structure , 2005, HPCC.

[16]  James Demmel,et al.  Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[17]  Eitan Grinspun,et al.  Sparse matrix solvers on the GPU: conjugate gradients and multigrid , 2003, SIGGRAPH Courses.

[18]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[19]  AykanatCevdet,et al.  Hypergraph-Partitioning-Based Decomposition for Parallel Sparse-Matrix Vector Multiplication , 1999 .

[20]  Calvin J. Ribbens,et al.  Pattern-based sparse matrix representation for memory-efficient SMVM kernels , 2009, ICS.

[21]  Georg Hager,et al.  Performance limitations for sparse matrix-vector multiplications on current multicore environments , 2009, ArXiv.

[22]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[23]  Mahmut T. Kandemir,et al.  Locality-aware mapping and scheduling for multicores , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[24]  Yunquan Zhang,et al.  Performance Evaluation of Multithreaded Sparse Matrix-Vector Multiplication Using OpenMP , 2009, 2009 11th IEEE International Conference on High Performance Computing and Communications.

[25]  Ramesh C. Agarwal,et al.  A high performance algorithm using pre-processing for the sparse matrix-vector multiplication , 1992, Proceedings Supercomputing '92.

[26]  Mahmut T. Kandemir,et al.  Cache topology aware computation mapping for multicores , 2010, PLDI '10.

[27]  Michael M. Wolf,et al.  Optimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning , 2008 .

[28]  Rajesh Bordawekar,et al.  Optimizing Sparse Matrix-Vector Multiplication on GPUs using Compile-time and Run-time Strategies , 2008 .

[29]  Michael Garland,et al.  Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[30]  Emilio L. Zapata,et al.  Data Distributions for Sparse Matrix Vector Multiplication , 1995, Parallel Comput..

[31]  Reiji Suda,et al.  Performance Evaluation of Parallel Sparse Matrix-Vector Products on SGI Altix3700 , 2005, IWOMP.

[32]  Kiran Kumar Matam,et al.  Accelerating Sparse Matrix Vector Multiplication in Iterative Methods Using GPU , 2011, 2011 International Conference on Parallel Processing.

[33]  Mahmut T. Kandemir,et al.  Optimizing Data Layouts for Parallel Computation on Multicores , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[34]  P. Sadayappan,et al.  On improving the performance of sparse matrix-vector multiplication , 1997, Proceedings Fourth International Conference on High-Performance Computing.