Computing the sparse matrix vector product using block-based kernels without zero padding on processors with AVX-512 instructions

The sparse matrix-vector product (SpMV) is a fundamental operation in many scientific applications from various fields. The High Performance Computing (HPC) community has therefore continuously invested a lot of effort to provide an efficient SpMV kernel on modern CPU architectures. It has been shown that block-based kernels are helpful to achieve high performance, but also that they are difficult to use in practice because of the important zero padding they imply. In the current paper, we propose new kernels using the AVX-512 instruction set, which makes it possible to use a blocking scheme without any zero padding in the matrix memory storage. We describe mask-based sparse matrix formats and their corresponding SpMV kernels highly optimized in assembly language. Considering that the optimal blocking size depends on the matrix, we also provide a method to predict the best kernel to be used utilizing a simple interpolation of the results from the previous executions. We compare the performance of our approach against the Intel MKL CSR kernel and the CSR5 open-source package on a set of standard benchmark matrices. We show that we can achieve significant improvements in many cases, both for sequential and for parallel execution. Finally, we provide the corresponding code in an open source library, called SPC5.

[1]  Youcef Saad,et al.  A Basic Tool Kit for Sparse Matrix Computations , 1990 .

[2]  Sivan Toledo,et al.  Improving the memory-system performance of sparse-matrix vector multiplication , 1997, IBM J. Res. Dev..

[3]  Fan Ye,et al.  A Study of SpMV Implementation Using MPI and OpenMP on Intel Many-Core Architecture , 2014, VECPAR.

[4]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[5]  Samuel Williams,et al.  Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[6]  A. N. Yzelman Generalised vectorisation for sparse matrix: vector multiplication , 2015, IA3@SC.

[7]  Richard Vuduc,et al.  Automatic performance tuning of sparse matrix kernels , 2003 .

[8]  Richard Barrett,et al.  Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.

[9]  Alston S. Householder,et al.  Handbook for Automatic Computation , 1960, Comput. J..

[10]  E. Cuthill,et al.  Reducing the bandwidth of sparse symmetric matrices , 1969, ACM '69.

[11]  P. Sadayappan,et al.  On improving the performance of sparse matrix-vector multiplication , 1997, Proceedings Fourth International Conference on High-Performance Computing.

[12]  Bérenger Bramas,et al.  Optimization and parallelization of the boundary element method for the wave equation in time domain. (Optimisation et parallèlisation de la méthode des élements frontières pour l'équation des ondes dans le domaine temporel) , 2016 .

[13]  Xing Liu,et al.  Efficient sparse matrix-vector multiplication on x86-based many-core processors , 2013, ICS '13.

[14]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[15]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[16]  Eun Im,et al.  Optimizing the Performance of Sparse Matrix-Vector Multiplication , 2000 .

[17]  Brian Vinter,et al.  CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication , 2015, ICS.

[18]  Hyun Jin Moon,et al.  Fast Sparse Matrix-Vector Multiplication by Exploiting Variable Block Structure , 2005, HPCC.

[19]  A. Pinar,et al.  Improving Performance of Sparse Matrix-Vector Multiplication , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[20]  Francisco F. Rivera,et al.  Performance optimization of irregular codes based on the combination of reordering and blocking techniques , 2005, Parallel Comput..

[21]  Gerhard Wellein,et al.  A Unified Sparse Matrix Data Format for Efficient General Sparse Matrix-Vector Multiplication on Modern Processors with Wide SIMD Units , 2013, SIAM J. Sci. Comput..

[22]  Katherine A. Yelick,et al.  Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY , 2001, International Conference on Computational Science.

[23]  Ramaseshan Kannan Efficient sparse matrix multiple-vector multiplication using a bitmapped format , 2013, 20th Annual International Conference on High Performance Computing.

[24]  Richard W. Vuduc,et al.  Sparsity: Optimization Framework for Sparse Matrix Kernels , 2004, Int. J. High Perform. Comput. Appl..