Performance Analysis of Sparse Matrix-Vector Multiplication (SpMV) on Graphics Processing Units (GPUs)

Graphics processing units (GPUs) have delivered a remarkable performance for a variety of high performance computing (HPC) applications through massive parallelism. One such application is sparse matrix-vector (SpMV) computations, which is central to many scientific, engineering, and other applications including machine learning. No single SpMV storage or computation scheme provides consistent and sufficiently high performance for all matrices due to their varying sparsity patterns. An extensive literature review reveals that the performance of SpMV techniques on GPUs has not been studied in sufficient detail. In this paper, we provide a detailed performance analysis of SpMV performance on GPUs using four notable sparse matrix storage schemes (compressed sparse row (CSR), ELLAPCK (ELL), hybrid ELL/COO (HYB), and compressed sparse row 5 (CSR5)), five performance metrics (execution time, giga floating point operations per second (GFLOPS), achieved occupancy, instructions per warp, and warp execution efficiency), five matrix sparsity features (nnz, anpr, nprvariance, maxnpr, and distavg), and 17 sparse matrices from 10 application domains (chemical simulations, computational fluid dynamics (CFD), electromagnetics, linear programming, economics, etc.). Subsequently, based on the deeper insights gained through the detailed performance analysis, we propose a technique called the heterogeneous CPU–GPU Hybrid (HCGHYB) scheme. It utilizes both the CPU and GPU in parallel and provides better performance over the HYB format by an average speedup of 1.7x. Heterogeneous computing is an important direction for SpMV and other application areas. Moreover, to the best of our knowledge, this is the first work where the SpMV performance on GPUs has been discussed in such depth. We believe that this work on SpMV performance analysis and the heterogeneous scheme will open up many new directions and improvements for the SpMV computing field in the future.

[1]  David S. Wise,et al.  Experiments with Quadtree Representation of Matrices , 1988, ISSAC.

[2]  Rashid Mehmood,et al.  SURAA: A Novel Method and Tool for Loadbalanced and Coalesced SpMV Computations on GPUs , 2019 .

[3]  Marta Z. Kwiatkowska,et al.  A Symbolic Out-of-Core Solution Method for Markov Models , 2002, Electron. Notes Theor. Comput. Sci..

[4]  William Gropp,et al.  Applications of the streamed storage format for sparse matrix operations , 2014, Int. J. High Perform. Comput. Appl..

[5]  Kurt Keutzer,et al.  clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs , 2012, ICS '12.

[6]  Jaafar M. H. Elmirghani,et al.  Performance Evaluation of a Metro WDM Multi-channel Ring Network with Variable-length Packets , 2007, 2007 IEEE International Conference on Communications.

[7]  Rashid Mehmood,et al.  Smarter Traffic Prediction Using Big Data, In-Memory Computing, Deep Learning and GPUs , 2019, Sensors.

[8]  Beata Bylina,et al.  A Markovian Model of a Network of Two Wireless Devices , 2012, CN.

[9]  Eric S. Chung,et al.  Towards a Universal FPGA Matrix-Vector Multiplication Architecture , 2012, 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines.

[10]  Michael Garland,et al.  Merge-Based Parallel Sparse Matrix-Vector Multiplication , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Frédéric Magoulès,et al.  Alinea: An Advanced Linear Algebra Library for Massively Parallel Computations on Graphics Processing Units , 2015, Int. J. High Perform. Comput. Appl..

[12]  Pavel Tvrdík,et al.  Evaluation Criteria for Sparse Matrix Storage Formats , 2016, IEEE Transactions on Parallel and Distributed Systems.

[13]  Rashid Mehmood,et al.  Big data logistics: a health-care transport capacity sharing model , 2015 .

[14]  Jason D. Bakos,et al.  Exploiting Matrix Symmetry to Improve FPGA-Accelerated Conjugate Gradient , 2009, 2009 17th IEEE Symposium on Field Programmable Custom Computing Machines.

[15]  Dejan Markovic,et al.  A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs , 2014, FPGA.

[16]  Ümit V. Çatalyürek,et al.  Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi , 2013, PPAM.

[17]  K. M. Azharul Hasan,et al.  Efficient storage scheme for n-dimensional sparse array: GCRS/GCCS , 2015, 2015 International Conference on High Performance Computing & Simulation (HPCS).

[18]  George A. Constantinides,et al.  Optimizing memory bandwidth use and performance for matrix-vector multiplication in iterative methods , 2011, TRETS.

[19]  P. Sadayappan,et al.  Effective Machine Learning Based Format Selection and Performance Modeling for SpMV on GPUs , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[20]  Srinivasan Parthasarathy,et al.  Fast Sparse Matrix-Vector Multiplication on GPUs for Graph Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[21]  Rashid Mehmood,et al.  Computational Markovian analysis of large systems , 2011 .

[22]  Taisir E.H. El-Gorashi,,et al.  A Mirroring Strategy for SANs in a Metro WDM Sectioned Ring Architecture under Different Traffic Scenarios , 2008 .

[23]  Georgi Kuzmanov,et al.  Reconfigurable sparse/dense matrix-vector multiplier , 2009, 2009 International Conference on Field-Programmable Technology.

[24]  Sherali Zeadally,et al.  Multimedia applications over metropolitan area networks (MANs) , 2011, J. Netw. Comput. Appl..

[25]  Srinivasan Parthasarathy,et al.  Automatic Selection of Sparse Matrix Representation on GPUs , 2015, ICS.

[26]  A. N. Yzelman Generalised vectorisation for sparse matrix: vector multiplication , 2015, IA3@SC.

[27]  Rashid Mehmood,et al.  UbeHealth: A Personalized Ubiquitous Cloud and Edge-Enabled Networked Healthcare System for Smart Cities , 2018, IEEE Access.

[28]  Davide Barbieri,et al.  Sparse Matrix-Vector Multiplication on GPGPUs , 2017, ACM Trans. Math. Softw..

[29]  Feng Shi,et al.  Sparse Matrix Format Selection with Multiclass SVM for SpMV on GPU , 2016, 2016 45th International Conference on Parallel Processing (ICPP).

[30]  Athanasios Fevgas,et al.  Efficient solution of large sparse linear systems in modern hardware , 2015, 2015 6th International Conference on Information, Intelligence, Systems and Applications (IISA).

[31]  Fangfang Li,et al.  Efficient sparse matrix-vector multiplication using cache oblivious extension quadtree storage format , 2016, Future Gener. Comput. Syst..

[32]  Michele Martone,et al.  Efficient multithreaded untransposed, transposed or symmetric sparse matrix-vector multiplication with the Recursive Sparse Blocks format , 2014, Parallel Comput..

[33]  Rashid Mehmood,et al.  ZAKI: A Smart Method and Tool for Automatic Performance Optimization of Parallel SpMV Computations on Distributed Memory Machines , 2019, Mobile Networks and Applications.

[34]  Rashid Mehmood,et al.  Exploring the influence of big data on city transport operations: a Markovian approach , 2017 .

[35]  Mario Di Francesco,et al.  Distributed Inference Acceleration with Adaptive DNN Partitioning and Offloading , 2020, IEEE INFOCOM 2020 - IEEE Conference on Computer Communications.

[36]  Brian Vinter,et al.  CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication , 2015, ICS.

[37]  Frédéric Magoulès,et al.  Efficient implementation of Jacobi iterative method for large sparse linear systems on graphic processing units , 2017, The Journal of Supercomputing.

[38]  Frédéric Magoulès,et al.  Fast sparse matrix-vector multiplication on graphics processing unit for finite element analysis , 2012, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems.

[39]  Mitsuo Gen,et al.  Accelerating genetic algorithms with GPU computing: A selective overview , 2019, Comput. Ind. Eng..

[40]  Goran Flegar,et al.  Overcoming Load Imbalance for Irregular Sparse Matrices , 2017, IA3@SC.

[41]  André DeHon,et al.  Floating-point sparse matrix-vector multiply for FPGAs , 2005, FPGA '05.

[42]  Weixing Ji,et al.  Sparse matrix partitioning for optimizing SpMV on CPU-GPU heterogeneous platforms , 2020, Int. J. High Perform. Comput. Appl..

[43]  Rashid Mehmood,et al.  Rapid Transit Systems: Smarter Urban Planning Using Big Data, In-Memory Computing, Deep Learning, and GPUs , 2019, Sustainability.

[44]  Zhang Qian,et al.  A new method of Sparse Matrix-Vector Multiplication on GPU , 2012, Proceedings of 2012 2nd International Conference on Computer Science and Network Technology.

[45]  Jeffrey S. Vetter,et al.  A Survey of CPU-GPU Heterogeneous Computing Techniques , 2015, ACM Comput. Surv..

[46]  Wayne Luk,et al.  Accelerating SpMV on FPGAs by Compressing Nonzero Values , 2015, 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines.

[47]  David M. Lucantoni,et al.  A Markov Modulated Characterization of Packetized Voice and Data Traffic and Related Statistical Multiplexer Performance , 1986, IEEE J. Sel. Areas Commun..

[48]  Kenli Li,et al.  Performance Analysis and Optimization for SpMV on GPU Using Probabilistic Modeling , 2015, IEEE Transactions on Parallel and Distributed Systems.

[49]  Rashid Mehmood,et al.  Performance Characteristics for Sparse Matrix-Vector Multiplication on GPUs , 2020 .

[50]  Rashid Mehmood,et al.  ZAKI+: A Machine Learning Based Process Mapping Tool for SpMV Computations on Distributed Memory Architectures , 2019, IEEE Access.

[51]  Rashid Mehmood,et al.  Parallel Iterative Solution of Large Sparse Linear Equation Systems on the Intel MIC Architecture , 2019, Smart Infrastructure and Applications.

[52]  Wu-chun Feng,et al.  Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[53]  Amy Nicole Langville,et al.  A Survey of Eigenvector Methods for Web Information Retrieval , 2005, SIAM Rev..

[54]  W VuducRichard,et al.  Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010 .

[55]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[56]  Wayne Luk,et al.  Optimising Sparse Matrix Vector multiplication for large scale FEM problems on FPGA , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[57]  Feng Shi,et al.  BestSF , 2018, ACM Trans. Archit. Code Optim..

[58]  Liqiang Wang,et al.  Auto-Tuning CUDA Parameters for Sparse Matrix-Vector Multiplication on GPUs , 2010, 2010 International Conference on Computational and Information Sciences.

[59]  Ping Guo,et al.  A Performance Modeling and Optimization Analysis Tool for Sparse Matrix-Vector Multiplication on GPUs , 2014, IEEE Transactions on Parallel and Distributed Systems.

[60]  Peter Luksch,et al.  Analysis of Sparse Matrix-Vector Multiplication Using Iterative Method in CUDA , 2013, 2013 IEEE Eighth International Conference on Networking, Architecture and Storage.

[61]  Walid A. Abu-Sufah,et al.  An Effective Approach for Implementing Sparse Matrix-Vector Multiplication on Graphics Processing Units , 2012, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems.

[62]  Kenli Li,et al.  A hybrid computing method of SpMV on CPU-GPU heterogeneous computing systems , 2017, J. Parallel Distributed Comput..

[63]  J. Elmirghani,et al.  A data Mirroring technique for SANs in a Metro WDM sectioned ring , 2008, 2008 International Conference on Optical Network Design and Modeling.

[64]  Gerhard Wellein,et al.  A Unified Sparse Matrix Data Format for Efficient General Sparse Matrix-Vector Multiplication on Modern Processors with Wide SIMD Units , 2013, SIAM J. Sci. Comput..

[65]  Scott A. Mahlke,et al.  Scalpel: Customizing DNN pruning to the underlying hardware parallelism , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).