CSR2: A New Format for SIMD-accelerated SpMV

SpMV (Sparse matrix-vector multiplication) has attracted the attention of researchers in related fields at home and abroad. Of course, improving SpMV performance has also been a research hot spot for researchers in related fields. In this paper, we propose a new sparse matrix storage format CSR2 (Compressed Sparse Row 2) suitable for SIMD (Single Instruction Multiple Data)-accelerated SpMV. First, the format operation of CSR2 is easy to implement and has a low overhead of conversion. Second, CSR2 is a new single format and suitable for use on processor platforms with SIMD vectorization. We compare the SpMV algorithm based on CSR21 with the one based on the current most advanced single format CSR5 (Compressed Sparse Row 5) on two mainstream high-performance processors: Intel Core i7-7700HQ CPU and Intel Xeon CPU E5-2670 v3. We choose 10 sets of regular matrices and 3 sets of irregular matrices to be used as benchmark suit. Experiments show that for the 13 sets of regular and irregular matrices in the benchmark suite, CSR2 has an average performance improvement of more than 50% compared to CSR5 (up to 125% on Intel Core i77700HQ CPU and 303% on Intel Xeon CPU E5-2670 v3). For applications with multiple iterations, in reality, using our CSR2 can bring low-overhead format conversion and high-throughput computing performance.

[1]  Razvan Nane,et al.  Sparstition: A Partitioning Scheme for Large-Scale Sparse Matrix Vector Multiplication on FPGA , 2019, 2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[2]  Qiao Sun,et al.  Bandwidth Reduced Parallel SpMV on the SW26010 Many-Core Platform , 2018, ICPP.

[3]  Graham Markall Accelerating Unstructured Mesh Computational Fluid Dynamics on the NVidia Tesla GPU Architecture , 2011 .

[4]  Kenli Li,et al.  Performance Analysis and Optimization for SpMV on GPU Using Probabilistic Modeling , 2015, IEEE Transactions on Parallel and Distributed Systems.

[5]  Gerhard Wellein,et al.  A Unified Sparse Matrix Data Format for Efficient General Sparse Matrix-Vector Multiplication on Modern Processors with Wide SIMD Units , 2013, SIAM J. Sci. Comput..

[6]  Jie Liu,et al.  VBSF: a new storage format for SIMD sparse matrix–vector multiplication on modern processors , 2019, The Journal of Supercomputing.

[7]  Frédéric Magoulès,et al.  Iterative Methods for Sparse Linear Systems on Graphics Processing Unit , 2012, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems.

[8]  Brian Vinter,et al.  CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication , 2015, ICS.

[9]  Lu Yao,et al.  Implementing Sparse Matrix-Vector multiplication using CUDA based on a hybrid sparse matrix format , 2010, 2010 International Conference on Computer Application and System Modeling (ICCASM 2010).

[10]  Kurt Keutzer,et al.  clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs , 2012, ICS '12.

[11]  Yu Wang,et al.  FPGA and GPU implementation of large scale SpMV , 2010, 2010 IEEE 8th Symposium on Application Specific Processors (SASP).

[12]  Srinivasan Parthasarathy,et al.  Fast Sparse Matrix-Vector Multiplication on GPUs for Graph Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[14]  Shengen Yan,et al.  yaSpMV: yet another SpMV framework on GPUs , 2014, PPoPP.

[15]  Zhen Jia,et al.  CVR: efficient vectorization of SpMV on x86 processors , 2018, CGO.

[16]  Michal Rewienski,et al.  GPU-Accelerated LOBPCG Method with Inexact Null-Space Filtering for Solving Generalized Eigenvalue Problems in Computational Electromagnetics Analysis with Higher-Order FEM , 2017 .

[17]  Pradeep Dubey,et al.  GraphMat: High performance graph analytics made productive , 2015, Proc. VLDB Endow..

[18]  Tao Li,et al.  CASpMV: A Customized and Accelerative SpMV Framework for the Sunway TaihuLight , 2021, IEEE Transactions on Parallel and Distributed Systems.

[19]  Zheng Xiao,et al.  hpSpMV: A Heterogeneous Parallel Computing Scheme for SpMV on the Sunway TaihuLight Supercomputer , 2019, 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[20]  Feng Yan,et al.  Efficient PageRank and SpMV Computation on AMD GPUs , 2010, 2010 39th International Conference on Parallel Processing.

[21]  Dirk Schmidl,et al.  Assessing the Performance of OpenMP Programs on the Intel Xeon Phi , 2013, Euro-Par.

[22]  P. Sadayappan,et al.  Effective Machine Learning Based Format Selection and Performance Modeling for SpMV on GPUs , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[23]  Ümit V. Çatalyürek,et al.  Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi , 2013, PPAM.

[24]  Tetsuya Sakurai,et al.  Block Krylov-type complex moment-based eigensolvers for solving generalized eigenvalue problems , 2017, Numerical Algorithms.

[25]  Jack J. Dongarra,et al.  An extended set of FORTRAN basic linear algebra subprograms , 1988, TOMS.

[26]  Paul H. J. Kelly,et al.  GiMMiK - Generating bespoke matrix multiplication kernels for accelerators: Application to high-order Computational Fluid Dynamics , 2016, Comput. Phys. Commun..

[27]  John D. Owens,et al.  Gunrock , 2017, ACM Trans. Parallel Comput..

[28]  Francisco Vázquez,et al.  A new approach for sparse matrix vector product on NVIDIA GPUs , 2011, Concurr. Comput. Pract. Exp..

[29]  Nectarios Koziris,et al.  CSX: an extended compression format for spmv on shared memory systems , 2011, PPoPP '11.