Robust speech recognition with speech enhanced deep neural networks

We propose a signal pre-processing front-end to enhance speech based on deep neural networks (DNNs) and use the enhanced speech features directly to train hidden Markov models (HMMs) for robust speech recognition. As a comprehensive study, we examine its effectiveness for different acoustic features, acoustic models, and training-testing combinations. Tested on the Aurora4 task the experimental results indicate that our proposed framework consistently outperform the stateof-the-art speech recognition systems in all evaluation conditions. To our best knowledge, this is the first showcase on the Aurora4 task yielding performance gains by using only an enhancement pre-processor without any adaptation or compensation post-processing on top of the best DNN-HMM system. The word error rate reduction from the baseline system is up to 50% for clean-condition training and 15% for multi-condition training. We believe the system performance could be improved further by incorporating post-processing techniques to work coherently with the proposed enhancement pre-processing scheme.

[1]  Yu Tsao,et al.  An investigation of spectral restoration algorithms for deep neural networks based noise robust speech recognition , 2013, INTERSPEECH.

[2]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[3]  Yifan Gong,et al.  An Overview of Noise-Robust Automatic Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[5]  Yifan Gong,et al.  Speech recognition in noisy environments: A survey , 1995, Speech Commun..

[6]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[7]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Xiaodong Cui,et al.  Stereo-Based Stochastic Mapping for Robust Speech Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Jun Du,et al.  Synthesized stereo-based stochastic mapping with data selection for robust speech recognition , 2012, 2012 8th International Symposium on Chinese Spoken Language Processing.

[10]  Yu Hu,et al.  HMM-based pseudo-clean speech synthesis for splice algorithm , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  DeLiang Wang,et al.  A Direct Masking Approach to Robust ASR , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[13]  Quoc V. Le,et al.  Recurrent Neural Networks for Noise Reduction in Robust ASR , 2012, INTERSPEECH.

[14]  Tomohiro Nakatani,et al.  Is speech enhancement pre-processing still relevant when using deep neural networks for acoustic modeling? , 2013, INTERSPEECH.

[15]  Jun Du,et al.  Synthesized stereo mapping via deep neural networks for noisy speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[17]  DeLiang Wang,et al.  Ideal ratio mask estimation using deep neural networks for robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Khalid Daoudi,et al.  Evaluation of the Space Denoising Algorithm on AURORA2 , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[19]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[20]  DeLiang Wang,et al.  Investigation of Speech Separation as a Front-End for Noise Robust Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21]  Li Deng,et al.  Evaluation of the SPLICE algorithm on the Aurora2 database , 2001, INTERSPEECH.

[22]  Yifan Gong,et al.  A minimum-mean-square-error noise reduction algorithm on Mel-frequency cepstra for robust speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Yongqiang Wang,et al.  An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.