Adaptive sparse tiling for sparse matrix multiplication

Tiling is a key technique for data locality optimization and is widely used in high-performance implementations of dense matrix-matrix multiplication for multicore/manycore CPUs and GPUs. However, the irregular and matrix-dependent data access pattern of sparse matrix multiplication makes it challenging to use tiling to enhance data reuse. In this paper, we devise an adaptive tiling strategy and apply it to enhance the performance of two primitives: SpMM (product of sparse matrix and dense matrix) and SDDMM (sampled dense-dense matrix multiplication). In contrast to studies that have resorted to non-standard sparse-matrix representations to enhance performance, we use the standard Compressed Sparse Row (CSR) representation, within which intra-row reordering is performed to enable adaptive tiling. Experimental evaluation using an extensive set of matrices from the Sparse Suite collection demonstrates significant performance improvement over currently available state-of-the-art alternatives.

[1]  Endong Wang,et al.  Intel Math Kernel Library , 2014 .

[2]  Francisco F. Rivera,et al.  Optimization of sparse matrix-vector multiplication using reordering techniques on GPUs , 2012, Microprocess. Microsystems.

[3]  Xuemin Lin,et al.  Speedup Graph Processing by Graph Ordering , 2016, SIGMOD Conference.

[4]  Zhen Jia,et al.  CVR: efficient vectorization of SpMV on x86 processors , 2018, CGO.

[5]  Hans-Peter Seidel,et al.  Globally homogeneous, locally adaptive sparse matrix-vector multiplication on the GPU , 2017, ICS.

[6]  Michalis K. Titsias,et al.  The Infinite Gamma-Poisson Feature Model , 2007, NIPS.

[7]  Pradeep Dubey,et al.  Faster CNNs with Direct Sparse Convolutions and Guided Pruning , 2016, ICLR.

[8]  Ting Wang,et al.  Optimizing SpMV for Diagonal Sparse Matrices on GPU , 2011, 2011 International Conference on Parallel Processing.

[9]  Srinivasan Parthasarathy,et al.  Efficient sparse-matrix multi-vector product on GPUs , 2018, HPDC.

[10]  Leonid Oliker,et al.  Effects of Ordering Strategies and Programming Paradigms on Sparse Matrix Computations , 2013, SIAM Rev..

[11]  John Canny,et al.  SAME but Different: Fast and High Quality Gibbs Parameter Estimation , 2014, KDD.

[12]  Kurt Keutzer,et al.  clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs , 2012, ICS '12.

[13]  Samuel Williams,et al.  Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[14]  Arutyun Avetisyan,et al.  Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures , 2010, HiPEAC.

[15]  Brian Vinter,et al.  CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication , 2015, ICS.

[16]  Francisco Vázquez,et al.  A new approach for sparse matrix vector product on NVIDIA GPUs , 2011, Concurr. Comput. Pract. Exp..

[17]  Stephen John Turner,et al.  Accelerating sparse matrix-vector multiplication on GPUs using bit-representation-optimized schemes , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[18]  Yongchao Liu,et al.  LightSpMV: Faster CSR-based sparse matrix-vector multiplication on CUDA-enabled GPUs , 2015, 2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[19]  John F. Canny,et al.  Collaborative filtering with privacy , 2002, Proceedings 2002 IEEE Symposium on Security and Privacy.

[20]  John D. Owens,et al.  Design Principles for Sparse Matrix Multiplication on the GPU , 2018, Euro-Par.

[21]  Michael F. P. O'Boyle,et al.  A large-scale cross-architecture evaluation of thread-coarsening , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[22]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[23]  John R. Gilbert,et al.  On the representation and multiplication of hypersparse matrices , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[24]  Viktor K. Prasanna,et al.  ReCALL: Reordered Cache Aware Locality Based Graph Processing , 2017, 2017 IEEE 24th International Conference on High Performance Computing (HiPC).

[25]  Hans-Peter Seidel,et al.  How naive is naive SpMV on the GPU? , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[26]  Michael Garland,et al.  Merge-based sparse matrix-vector multiplication (SpMV) using the CSR storage format , 2016, PPoPP.

[27]  Michael Garland,et al.  Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[28]  Michael F. P. O'Boyle,et al.  Automatic optimization of thread-coarsening for graphics processors , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[29]  Scott McMillan,et al.  Design of the GraphBLAS API for C , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[30]  Victor Eijkhout,et al.  Performance Optimization and Modeling of Blocked Sparse Kernels , 2007, Int. J. High Perform. Comput. Appl..

[31]  Samuel Williams,et al.  Optimizing Sparse Matrix-Multiple Vectors Multiplication for Nuclear Configuration Interaction Calculations , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[32]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[33]  Wilfred Pinfold,et al.  Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis , 2009, HiPC 2009.

[34]  Rob H. Bisseling,et al.  Cache-Oblivious Sparse Matrix--Vector Multiplication by Using Sparse Matrix Partitioning Methods , 2009, SIAM J. Sci. Comput..

[35]  C.W. Kessler,et al.  The SPARAMAT approach to automatic comprehension of sparse matrix computations , 1999, Proceedings Seventh International Workshop on Program Comprehension.

[36]  John Canny,et al.  BIDMach: Large-scale Learning with Zero Memory Allocation , 2013 .

[37]  Shoaib Kamil,et al.  The tensor algebra compiler , 2017, Proc. ACM Program. Lang..

[38]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[39]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[40]  Francisco Vázquez,et al.  FastSpMM: An Efficient Library for Sparse Matrix Matrix Product on GPUs , 2014, Comput. J..

[41]  Joseph L. Greathouse,et al.  Efficient Sparse Matrix-Vector Multiplication on GPUs Using the CSR Storage Format , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[42]  Jack J. Dongarra,et al.  Accelerating the LOBPCG method on GPUs using a blocked sparse matrix vector product , 2015, SpringSim.

[43]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[44]  Shengen Yan,et al.  yaSpMV: yet another SpMV framework on GPUs , 2014, PPoPP.

[45]  Ümit V. Çatalyürek,et al.  Regularizing graph centrality computations , 2015, J. Parallel Distributed Comput..

[46]  Daisuke Takahashi,et al.  Automatic Tuning of Sparse Matrix-Vector Multiplication for CRS Format on GPUs , 2012, 2012 IEEE 15th International Conference on Computational Science and Engineering.

[47]  Yehuda Koren,et al.  Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[48]  Elizabeth R. Jessup,et al.  On Improving Linear Solver Performance: A Block Variant of GMRES , 2005, SIAM J. Sci. Comput..