论文信息 - Combining non-negative matrix factorization and deep neural networks for speech enhancement and automatic speech recognition

Combining non-negative matrix factorization and deep neural networks for speech enhancement and automatic speech recognition

Sparse Non-negative Matrix Factorization (SNMF) and Deep Neural Networks (DNN) have emerged individually as two efficient machine learning techniques for single-channel speech enhancement. Nevertheless, there are only few works investigating the combination of SNMF and DNN for speech enhancement and robust Automatic Speech Recognition (ASR). In this paper, we present a novel combination of speech enhancement components based-on SNMF and DNN into a full-stack system. We refine the cost function of the DNN to back-propagate the reconstruction error of the enhanced speech. Our proposal is compared with several state-of-the-art speech enhancement systems. Evaluations are conducted on the data of CHiME-3 challenge which consists of real noisy speech recordings captured under challenging noisy conditions. Our system yields significant improvements for both objective quality speech enhancement measurements with relative gain of 30%, and a 10% relative Word Error Rate reduction for ASR compared to the best baselines.

Chng Eng Siong | Thanh T. Vu | Benjamin Bigot

[1] Zi Wang,et al. Discriminative non-negative matrix factorization for single-channel speech separation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Yi Hu,et al. Evaluation of Objective Quality Measures for Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[3] Jon Barker,et al. The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[4] Barak A. Pearlmutter,et al. Convolutive Non-Negative Matrix Factorisation with a Sparseness Constraint , 2006, 2006 16th IEEE Signal Processing Society Workshop on Machine Learning for Signal Processing.

[5] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[6] Nam Soo Kim,et al. NMF-based Target Source Separation Using Deep Neural Network , 2015, IEEE Signal Processing Letters.

[7] Jonathan Le Roux,et al. Deep NMF for speech separation , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] Philipos C. Loizou,et al. Speech Enhancement: Theory and Practice , 2007 .

[9] Jonathan Le Roux,et al. Discriminative NMF and its application to single-channel source separation , 2014, INTERSPEECH.

[10] Jun Du,et al. Speech separation based on improved deep neural networks with dual outputs of speech features for both target and interfering speakers , 2014, The 9th International Symposium on Chinese Spoken Language Processing.

[11] Yoshua. Bengio,et al. Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[12] Nobuhiko Kitawaki,et al. Objective quality evaluation for low-bit-rate speech coding systems , 1988, IEEE J. Sel. Areas Commun..

[13] Jun Du,et al. An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[14] H. Sebastian Seung,et al. Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[15] Bhiksha Raj,et al. Supervised and Semi-supervised Separation of Sounds from Single-Channel Mixtures , 2007, ICA.