Parallel and Scalable Sparse Basic Linear Algebra Subprograms

Sparse basic linear algebra subprograms (BLAS) are fundamental building blocks for numerous scientific computations and graph applications. Compared with Dense BLAS, parallelization of Sparse BLAS routines entails extra challenges due to the irregularity of sparse data structures. This thesis proposes new fundamental algorithms and data structures that accelerate Sparse BLAS routines on modern massively parallel processors: (1) a new heap data structure named ad-heap, for faster heap operations on heterogeneous processors, (2) a new sparse matrix representation named CSR5, for faster sparse matrix-vector multiplication (SpMV) on homogeneous processors such as CPUs, GPUs and Xeon Phi, (3) a new CSR-based SpMV algorithm for a variety of tightly coupled CPU-GPU heterogeneous processors, and (4) a new framework and associated algorithms for sparse matrix-matrix multiplication (SpGEMM) on GPUs and heterogeneous processors. The thesis compares the proposed methods with state-of-the-art approaches on six homogeneous and five heterogeneous processors from Intel, AMD and nVidia. Using in total 38 sparse matrices as a benchmark suite, the experimental results show that the proposed methods obtain significant performance improvement over the best existing algorithms.

[1]  Brian Vinter,et al.  CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication , 2015, ICS.

[2]  Karl Meerbergen,et al.  Sparse Matrix-Vector Multiplication: Parallelization and Vectorization , 2015 .

[3]  Andrew Lumsdaine,et al.  Accelerating sparse matrix computations via data compression , 2006, ICS '06.

[4]  Yi Yang,et al.  CPU-assisted GPGPU on fused CPU-GPU architectures , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[5]  Rob H. Bisseling,et al.  Two-dimensional cache-oblivious sparse matrix-vector multiplication , 2011, Parallel Comput..

[6]  Shengen Yan,et al.  yaSpMV: yet another SpMV framework on GPUs , 2014, PPoPP.

[7]  Samuel Williams,et al.  The potential of the cell processor for scientific computing , 2005, CF '06.

[8]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2009, Parallel Comput..

[9]  Dirk Roose,et al.  High-level strategies for parallel shared-memory sparse matrix – vector multiplication , 2012 .

[10]  Brian Vinter,et al.  A framework for general sparse matrix-matrix multiplication on GPUs and heterogeneous processors , 2015, J. Parallel Distributed Comput..

[11]  Sergio Escalera,et al.  Unsupervised Behavior-Specific Dictionary Learning for Abnormal Event Detection , 2015, BMVC.

[12]  Chao Yang,et al.  Optimizing and Scaling HPCG on Tianhe-2: Early Experience , 2014, ICA3PP.

[13]  Nan Zhang A Novel Parallel Scan for Multicore Processors and Its Application in Sparse Matrix-Vector Multiplication , 2012, IEEE Transactions on Parallel and Distributed Systems.

[14]  Raphael Yuster,et al.  Fast Sparse Matrix Multiplication , 2004, ESA.

[15]  Hsien-Hsin S. Lee,et al.  COMPASS: a programmable data prefetcher using idle GPU shaders , 2010, ASPLOS XV.

[16]  Brian Vinter,et al.  Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors , 2015, Parallel Comput..

[17]  Hsien-Hsin S. Lee,et al.  Chameleon: Virtualizing idle acceleration cores of a heterogeneous multicore processor for caching and prefetching , 2010, TACO.

[18]  Brian Vinter,et al.  An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[19]  A. N. Yzelman Fast sparse matrix-vector multiplication by partitioning and reordering , 2011 .

[20]  Mark Allen Weiss,et al.  Data structures and algorithm analysis , 1991 .