Noise adaptive front-end normalization based on Vector Taylor Series for Deep Neural Networks in robust speech recognition

Deep Neural Networks (DNNs) have been successfully applied to various speech tasks during recent years. In this paper, we investigate the use of DNNs for noise-robust speech recognition and demonstrate their superior capabilities of modeling acoustic variations over the conventional Gaussian Mixture Models (GMMs). We then propose to compensate the normalization front-end of the DNNs using the GMM-based Vector Taylor Series (VTS) model compensation technique, which has been successfully applied in the GMM-based ASR systems to handle noisy speech. To fully benefit from both the powerful modeling capability of the DNN and the effective noise compensation of the VTS, an adaptive training algorithm is further developed. The preliminary experimental results on the AURORA 2 task have demonstrated the effectiveness of our approach. The adaptively trained system has been shown to outperform the GMM-based VTS adaptive training by relatively 18.8% using the MFCC features and 21.9% using the FBank features.

[1]  Li Deng,et al.  High-performance robust speech recognition using stereo training data , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[2]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[3]  Quoc V. Le,et al.  Recurrent Neural Networks for Noise Reduction in Robust ASR , 2012, INTERSPEECH.

[4]  Alejandro Acero,et al.  Acoustical and environmental robustness in automatic speech recognition , 1991 .

[5]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Michael Picheny,et al.  Robust speech recognition in noise --- performance of the IBM continuous speech recogniser on the ARPA noise spoke task , 1995 .

[7]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Dong Yu,et al.  Investigation of full-sequence training of deep belief networks for speech recognition , 2010, INTERSPEECH.

[9]  Petr Fousek,et al.  Factorial Hidden Restricted Boltzmann Machines for noise robust speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Richard M. Stern,et al.  A vector Taylor series approach for environment-independent speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[11]  Daniel Povey,et al.  Revisiting Recurrent Neural Networks for robust ASR , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Geoffrey E. Hinton,et al.  Learning a better representation of speech soundwaves using restricted boltzmann machines , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Alex Acero,et al.  Noise Adaptive Training for Robust Automatic Speech Recognition , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.