An efficient sparse-dense matrix multiplication on a multicore system

Deep Neural Network (DNN) is currently widely used in various applications, such as speech recognition, computer vision, etc. The computation kernel of DNN-based applications is large sparse-dense matrix multiplication. As the performance of existing methods and software libraries for sparse matrix multiplication is not as good as expected, real-time recognition process has not been achieved yet. Therefore, we propose a novel sparse matrix storage format, block-based CSR (compressed storage format) and COO (coordinate format), called BCSR&BCOO, and a thread-scalable computing kernel for sparse-dense matrix multiplication, called BSpMM. We evaluate the performance of our proposed data structure and computing kernel in a real application in DNN-based online speech recognition. The experimental results demonstrate up to 4x speedup over Intel MKL on a typical CPU-based multicore system. Significant improvement in FLOPS is observed as well.

[1]  Jack Dongarra,et al.  LAPACK Users' Guide, 3rd ed. , 1999 .

[2]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Michael Garland,et al.  Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[4]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[5]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[6]  Robert A. van de Geijn,et al.  High-performance implementation of the level-3 BLAS , 2008, TOMS.