论文信息 - An efficient sparse-dense matrix multiplication on a multicore system

An efficient sparse-dense matrix multiplication on a multicore system

Deep Neural Network (DNN) is currently widely used in various applications, such as speech recognition, computer vision, etc. The computation kernel of DNN-based applications is large sparse-dense matrix multiplication. As the performance of existing methods and software libraries for sparse matrix multiplication is not as good as expected, real-time recognition process has not been achieved yet. Therefore, we propose a novel sparse matrix storage format, block-based CSR (compressed storage format) and COO (coordinate format), called BCSR&BCOO, and a thread-scalable computing kernel for sparse-dense matrix multiplication, called BSpMM. We evaluate the performance of our proposed data structure and computing kernel in a real application in DNN-based online speech recognition. The experimental results demonstrate up to 4x speedup over Intel MKL on a typical CPU-based multicore system. Significant improvement in FLOPS is observed as well.

[1] Jack Dongarra,et al. LAPACK Users' Guide, 3rd ed. , 1999 .

[2] Pascal Vincent,et al. Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3] Michael Garland,et al. Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[4] Gene H. Golub,et al. Matrix computations (3rd ed.) , 1996 .

[5] Robert A. van de Geijn,et al. Anatomy of high-performance matrix multiplication , 2008, TOMS.

[6] Robert A. van de Geijn,et al. High-performance implementation of the level-3 BLAS , 2008, TOMS.