Spectral Feature Mapping with MIMIC Loss for Robust Speech Recognition

For the task of speech enhancement, local learning objectives are agnostic to phonetic structures helpful for speech recognition. We propose to add a global criterion to ensure de-noised speech is useful for downstream tasks like ASR. We first train a spectral classifier on clean speech to predict senone labels. Then, the spectral classifier is joined with our speech enhancer as a noisy speech recognizer. This model is taught to imitate the output of the spectral classifier alone on clean speech. This mimic loss is combined with the traditional local criterion to train the speech enhancer to produce de-noised speech. Feeding the de-noised speech to an off-the-shelf Kaldi training recipe for the CHiME- 2 corpus shows significant improvements in WER.

[1]  DeLiang Wang,et al.  Deep neural network based spectral feature mapping for robust speech recognition , 2015, INTERSPEECH.

[2]  Yifan Gong,et al.  Large-Scale Domain Adaptation via Teacher-Student Learning , 2017, INTERSPEECH.

[3]  Zheng-Hua Tan,et al.  Conditional Generative Adversarial Networks for Speech Enhancement and Noise-Robust Speaker Verification , 2017, INTERSPEECH.

[4]  Jonathan Le Roux,et al.  Student-teacher network learning with enhanced features , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Björn W. Schuller,et al.  Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR , 2015, LVA/ICA.

[6]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[7]  John R. Hershey,et al.  Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks , 2015, INTERSPEECH.

[8]  Cong Liu,et al.  The USTC-iFlytek System for CHiME-4 Challenge , 2016 .

[9]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Michael I. Mandel,et al.  Combining spectral feature mapping and multi-channel model-based source separation for noise-robust automatic speech recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[11]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[12]  Jon Barker,et al.  The second ‘chime’ speech separation and recognition challenge: Datasets, tasks and baselines , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  DeLiang Wang,et al.  Learning spectral mapping for speech dereverberation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Philip C. Woodland,et al.  Very deep convolutional neural networks for robust speech recognition , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[15]  Zhong-Qiu Wang,et al.  A Joint Training Framework for Robust Automatic Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  G. K.,et al.  Learning Spectral Mapping for Speech Dereverberation and Denoising , 2017 .

[17]  Tomoko Matsui,et al.  Robust Speech Recognition Using Generalized Distillation Framework , 2016, INTERSPEECH.

[18]  DeLiang Wang,et al.  Improving Robustness of Deep Neural Network Acoustic Models via Speech Separation and Joint Adaptive Training , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[20]  Chengzhu Yu,et al.  The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).