A framework for general sparse matrix-matrix multiplication on GPUs and heterogeneous processors

General sparse matrix-matrix multiplication (SpGEMM) is a fundamental building block for numerous applications such as algebraic multigrid method (AMG), breadth first search and shortest path problem. Compared to other sparse BLAS routines, an efficient parallel SpGEMM implementation has to handle extra irregularity from three aspects: (1) the number of nonzero entries in the resulting sparse matrix is unknown in advance, (2) very expensive parallel insert operations at random positions in the resulting sparse matrix dominate the execution time, and (3) load balancing must account for sparse data in both input matrices.In this work we propose a framework for SpGEMM on GPUs and emerging CPU-GPU heterogeneous processors. This framework particularly focuses on the above three problems. Memory pre-allocation for the resulting matrix is organized by a hybrid method that saves a large amount of global memory space and efficiently utilizes the very limited on-chip scratchpad memory. Parallel insert operations of the nonzero entries are implemented through the GPU merge path algorithm that is experimentally found to be the fastest GPU merge approach. Load balancing builds on the number of necessary arithmetic operations on the nonzero entries and is guaranteed in all stages.Compared with the state-of-the-art CPU and GPU SpGEMM methods, our approach delivers excellent absolute performance and relative speedups on various benchmarks multiplying matrices with diverse sparsity structures. Furthermore, on heterogeneous processors, our SpGEMM approach achieves higher throughput by using re-allocatable shared virtual memory. We design a framework for SpGEMM on modern manycore processors using the CSR format.We present a hybrid method for pre-allocating the resulting sparse matrix.We propose an efficient parallel insert method for long rows of the resulting matrix.We develop a heuristic-based load balancing strategy.Our approach significantly outperforms other known CPU and GPU SpGEMM methods.

[1]  Toshio Nakatani,et al.  AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[2]  Hagen Peters,et al.  Comparison-Based In-Place Sorting with CUDA , 2012 .

[3]  John R. Gilbert,et al.  High-Performance Graph Algorithms from Parallel Sparse Matrices , 2006, PARA.

[4]  Uwe Naumann,et al.  GPU-Accelerated Sparse Matrix-Matrix Multiplication by Iterative Row Merging , 2015, SIAM J. Sci. Comput..

[5]  Michael Garland,et al.  Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[6]  Jie Shen,et al.  Improving performance by matching imbalanced workloads with heterogeneous platforms , 2014, ICS '14.

[7]  Michael Garland,et al.  Designing efficient sorting algorithms for manycore GPUs , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[8]  Aaftab Munshi,et al.  The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[9]  Ümit V. Çatalyürek,et al.  Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi , 2013, PPAM.

[10]  Brian Vinter,et al.  CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication , 2015, ICS.

[11]  Song Yu,et al.  SPARSE MATRIX-VECTOR MULTIPLICATION ON NVIDIA GPU , 2012 .

[12]  Francisco Vázquez,et al.  Fast Sparse Matrix Matrix Product Based on ELLR-T and GPU Computing , 2012, 2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications.

[13]  Raphael Yuster,et al.  Fast sparse matrix multiplication , 2004, TALG.

[14]  Samuel Williams,et al.  Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[15]  DemmelJames,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2009 .

[16]  John R. Gilbert,et al.  Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication , 2008, 2008 37th International Conference on Parallel Processing.

[17]  James Demmel,et al.  Communication optimal parallel multiplication of sparse random matrices , 2013, SPAA.

[18]  S. Lund,et al.  Bohrium: A Virtual Machine Approach to Portable Parallelism , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[19]  Brian Vinter,et al.  Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors , 2015, Parallel Comput..

[20]  John R. Gilbert,et al.  On the representation and multiplication of hypersparse matrices , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[21]  Norbert Luttenberger,et al.  A Novel Sorting Algorithm for Many-core Architectures Based on Adaptive Bitonic Sort , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[22]  Kiran Kumar Matam,et al.  Sparse matrix-matrix multiplication on modern architectures , 2012, 2012 19th International Conference on High Performance Computing.

[23]  Jianbin Fang,et al.  A Comprehensive Performance Comparison of CUDA and OpenCL , 2011, 2011 International Conference on Parallel Processing.

[24]  Matt Pharr,et al.  Gpu gems 2: programming techniques for high-performance graphics and general-purpose computation , 2005 .

[25]  Rasmus Pagh,et al.  Better Size Estimation for Sparse Matrix Products , 2010, Algorithmica.

[26]  Fred G. Gustavson,et al.  Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition , 1978, TOMS.

[27]  George Varghese,et al.  A 22nm IA multi-CPU and GPU System-on-Chip , 2012, 2012 IEEE International Solid-State Circuits Conference.

[28]  Luke N. Olson,et al.  Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods , 2012, SIAM J. Sci. Comput..

[29]  David A. Bader,et al.  GPU merge path: a GPU merging algorithm , 2012, ICS '12.

[30]  William J. Dally,et al.  GPUs and the Future of Parallel Computing , 2011, IEEE Micro.

[31]  Haim Kaplan,et al.  Colored intersection searching via sparse rectangular matrix multiplication , 2006, SCG '06.

[32]  Andrew A. Davidson,et al.  Efficient parallel merge sort for fixed and variable length keys , 2012, 2012 Innovative Parallel Computing (InPar).

[33]  Kim M. Hazelwood,et al.  Where is the data? Why you cannot debate CPU vs. GPU performance without the answer , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.

[34]  Jie Shen,et al.  An application-centric evaluation of OpenCL on multi-core CPUs , 2013, Parallel Comput..

[35]  John R. Gilbert,et al.  Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments , 2011, SIAM J. Sci. Comput..

[36]  Kanad Ghose,et al.  Caching-efficient multithreaded fast multiplication of sparse matrices , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[37]  Luke N. Olson,et al.  Optimizing Sparse Matrix—Matrix Multiplication for the GPU , 2015, ACM Trans. Math. Softw..

[38]  John R. Gilbert,et al.  Sparse Matrices in MATLAB: Design and Implementation , 1992, SIAM J. Matrix Anal. Appl..

[39]  Maurice Steinman,et al.  AMD Fusion APU: Llano , 2012, IEEE Micro.

[40]  Francisco Vázquez,et al.  FastSpMM: An Efficient Library for Sparse Matrix Matrix Product on GPUs , 2014, Comput. J..

[41]  Raphael Yuster,et al.  Finding heaviest H-subgraphs in real weighted graphs, with applications , 2006, TALG.

[42]  Mehmet Deveci,et al.  Sparse Matrix-Matrix Multiplication for Modern Architectures , 2016 .

[43]  Endong Wang,et al.  Intel Math Kernel Library , 2014 .

[44]  Rasmus Pagh,et al.  The Input/Output Complexity of Sparse Matrix Multiplication , 2014, ESA.

[45]  Timothy M. Chan More algorithms for all-pairs shortest paths in weighted graphs , 2007, STOC '07.

[46]  Pradeep Dubey,et al.  Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs , 2009, Proc. VLDB Endow..

[47]  Brian Vinter,et al.  An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[48]  Edith Cohen On Optimizing Multiplications of Sparse Matrices , 1996, IPCO.

[49]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.