Sparse Matrix-Vector Multiplication on GPU

Sparse Matrix-Vector multiplication (SpMV) is one of the key operations in linear algebra. Overcoming thread divergence, load imbalance and un-coalesced and indirect memory access due to sparsity and irregularity are challenges to optimizing SpMV on GPUs. This dissertation develops solutions that address these challenges effectively. The first part of this dissertation focuses on a new blocked row-column (BRC) storage format with a two-dimensional blocking mechanism. It reduces thread divergence by reordering and blocking rows of the input matrix with nearly equal number of non-zero elements onto the same execution units (i.e., warps). BRC improves load balance by partitioning rows into blocks with a constant number of non-zeros such that different warps perform the same amount of work. We also present an approach to optimize BRC performance by judicious selection of block size based on sparsity characteristics of the matrix. Themost commonly used format for a sparsematrix is CSR (Compressed Sparse Row), but a number of other representations have recently been developed that achieve higher SpMV performance. However, the alternative representations typically impose a significant preprocessing overhead. While a high preprocessing overhead can be amortized for applications requiring many iterative invocations

[1]  Chih-Jen Lin,et al.  Trust Region Newton Method for Logistic Regression , 2008, J. Mach. Learn. Res..

[2]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[3]  Robert D. Falgout,et al.  hypre: A Library of High Performance Preconditioners , 2002, International Conference on Computational Science.

[4]  Christopher Ré,et al.  Materialization optimizations for feature selection workloads , 2014, SIGMOD Conference.

[5]  I. Reguly,et al.  Efficient sparse matrix-vector multiplication on cache-based GPUs , 2012, 2012 Innovative Parallel Computing (InPar).

[6]  B. Ribeiro,et al.  GPUMLib : An Efficient Open-Source GPU Machine Learning Library , 2011 .

[7]  P. Sadayappan,et al.  High-performance sparse matrix-vector multiplication on GPUs for structured grid computations , 2012, GPGPU-5.

[8]  John Nickolls,et al.  Scalable parallel programming with CUDA introduction , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[9]  Kurt Keutzer,et al.  Fast support vector machine training and classification on graphics processors , 2008, ICML '08.

[10]  P. Baldi,et al.  Searching for exotic particles in high-energy physics with deep learning , 2014, Nature Communications.

[11]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[12]  Xing Liu,et al.  Efficient sparse matrix-vector multiplication on x86-based many-core processors , 2013, ICS '13.

[13]  Y. Saad,et al.  Krylov Subspace Methods on Supercomputers , 1989 .

[14]  Christos Faloutsos,et al.  Random walk with restart: fast solutions and applications , 2008, Knowledge and Information Systems.

[15]  Tinkara Toš,et al.  Graph Algorithms in the Language of Linear Algebra , 2012, Software, environments, tools.

[16]  Noel Lopes,et al.  GPUMLib: A new Library to combine Machine Learning algorithms with Graphics Processing Units , 2010, 2010 10th International Conference on Hybrid Intelligent Systems.

[17]  Toby Sharp,et al.  Implementing Decision Trees and Forests on a GPU , 2008, ECCV.

[18]  Rajesh Bordawekar,et al.  Optimizing Sparse Matrix-Vector Multiplication on GPUs , 2009 .

[19]  Clément Farabet,et al.  Torch7: A Matlab-like Environment for Machine Learning , 2011, NIPS 2011.

[20]  Srinivasan Parthasarathy,et al.  Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining , 2011, Proc. VLDB Endow..

[21]  Shirish Tatikonda,et al.  SystemML: Declarative machine learning on MapReduce , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[22]  Hai Jin,et al.  Optimization of Sparse Matrix-Vector Multiplication with Variant CSR on GPUs , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[23]  Jonathan D. Hogg A Fast Dense Triangular Solve in CUDA , 2013, SIAM J. Sci. Comput..

[24]  Chia-Hua Ho,et al.  Large-scale linear support vector regression , 2012, J. Mach. Learn. Res..

[25]  Eurípides Montagne,et al.  An Alternative Compressed Storage Format for Sparse Matrices , 2003, ISCIS.

[26]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2009, Parallel Comput..

[27]  James Demmel,et al.  Fast Reproducible Floating-Point Summation , 2013, 2013 IEEE 21st Symposium on Computer Arithmetic.

[28]  Roy H. Campbell,et al.  A Parallel Implementation of K-Means Clustering on GPUs , 2008, PDPTA.

[29]  Shengen Yan,et al.  yaSpMV: yet another SpMV framework on GPUs , 2014, PPoPP.

[30]  John F. Canny,et al.  Big data analytics with small footprint: squaring the cloud , 2013, KDD.

[31]  Yao Zhang,et al.  Scan primitives for GPU computing , 2007, GH '07.

[32]  Michael Garland,et al.  Efficient Sparse Matrix-Vector Multiplication on CUDA , 2008 .

[33]  Jie Cheng,et al.  Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..

[34]  Rajat Raina,et al.  Large-scale deep unsupervised learning using graphics processors , 2009, ICML '09.

[35]  Richard Vuduc,et al.  Automatic performance tuning of sparse matrix kernels , 2003 .

[36]  Shirish Tatikonda,et al.  Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML , 2014, Proc. VLDB Endow..

[37]  Tao Wang,et al.  Deep learning with COTS HPC systems , 2013, ICML.

[38]  Olivier Chapelle,et al.  Training a Support Vector Machine in the Primal , 2007, Neural Computation.

[39]  Srinivasan Parthasarathy,et al.  Fast Sparse Matrix-Vector Multiplication on GPUs for Graph Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[40]  Marco Rosa,et al.  Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks , 2010, WWW.

[41]  Richard W. Vuduc,et al.  Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.

[42]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[43]  John Canny,et al.  BIDMach: Large-scale Learning with Zero Memory Allocation , 2013 .

[44]  Michael Garland,et al.  Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.