Deep Neural Network (DNN) is currently widely used in various applications, such as speech recognition, computer vision, etc. The computation kernel of DNN-based applications is large sparse-dense matrix multiplication. As the performance of existing methods and software libraries for sparse matrix multiplication is not as good as expected, real-time recognition process has not been achieved yet. Therefore, we propose a novel sparse matrix storage format, block-based CSR (compressed storage format) and COO (coordinate format), called BCSR&BCOO, and a thread-scalable computing kernel for sparse-dense matrix multiplication, called BSpMM. We evaluate the performance of our proposed data structure and computing kernel in a real application in DNN-based online speech recognition. The experimental results demonstrate up to 4x speedup over Intel MKL on a typical CPU-based multicore system. Significant improvement in FLOPS is observed as well.
[1]
Jack Dongarra,et al.
LAPACK Users' Guide, 3rd ed.
,
1999
.
[2]
Pascal Vincent,et al.
Representation Learning: A Review and New Perspectives
,
2012,
IEEE Transactions on Pattern Analysis and Machine Intelligence.
[3]
Michael Garland,et al.
Implementing sparse matrix-vector multiplication on throughput-oriented processors
,
2009,
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.
[4]
Gene H. Golub,et al.
Matrix computations (3rd ed.)
,
1996
.
[5]
Robert A. van de Geijn,et al.
Anatomy of high-performance matrix multiplication
,
2008,
TOMS.
[6]
Robert A. van de Geijn,et al.
High-performance implementation of the level-3 BLAS
,
2008,
TOMS.