An investigation of spectral restoration algorithms for deep neural networks based noise robust speech recognition

Deep Neural Networks (DNNs) are becoming widely accepted in automatic speech recognition (ASR) systems. The deep structured nonlinear processing greatly improves the model’s generalization capability, but the performance under adverse environments is still unsatisfactory. In the literature, there have been many techniques successfully developed to improve Gaussian mixture models’ robustness. Investigating the effectiveness of these techniques for the DNN is an important step to thoroughly understand its superiority, pinpoint its limitations and most importantly to further improve it towards the ultimate human-level robustness. In this paper, we investigate the effectiveness of speech enhancement using spectral restoration algorithms for DNNs. Four approaches are evaluated, namely minimum mean-square error spectral estimator (MMSE), maximum likelihood spectral amplitude estimator (MLSA), maximum a posteriori spectral amplitude estimator (MAPA), and generalized maximum a posteriori spectral amplitude algorithm (GMAPA). The preliminary experimental results on the Aurora 2 speech database show that with multi-condition training data the DNN itself is capable of learning robust representations. However, if only clean data is available, the MLSA algorithm is the best spectral restoration training method for DNNs.

[1]  Khe Chai Sim,et al.  Noise adaptive front-end normalization based on Vector Taylor Series for Deep Neural Networks in robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  R. McAulay,et al.  Speech enhancement using a soft-decision noise suppression filter , 1980 .

[3]  Jacob Benesty,et al.  Fundamentals of Noise Reduction , 2008 .

[4]  John H. L. Hansen,et al.  Speech Enhancement Based on Generalized Minimum Mean Square Error Estimators and Masking Properties of the Auditory System , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  I. Cohen,et al.  Noise estimation by minima controlled recursive averaging for robust speech enhancement , 2002, IEEE Signal Processing Letters.

[6]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[7]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Jesper Jensen,et al.  Maximum likelihood based noise covariance matrix estimation for multi-microphone speech enhancement , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[9]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[10]  Rainer Martin,et al.  Speech enhancement based on minimum mean-square error estimation and supergaussian priors , 2005, IEEE Transactions on Speech and Audio Processing.

[11]  Dong Yu,et al.  Feature Learning in Deep Neural Networks - A Study on Speech Recognition Tasks , 2013, ICLR.

[12]  Quoc V. Le,et al.  Recurrent Neural Networks for Noise Reduction in Robust ASR , 2012, INTERSPEECH.

[13]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[14]  Jinyu Li,et al.  Feature Learning in Deep Neural Networks - Studies on Speech Recognition Tasks. , 2013, ICLR 2013.

[15]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[16]  Dong Yu,et al.  Investigation of full-sequence training of deep belief networks for speech recognition , 2010, INTERSPEECH.

[17]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Tim Fingscheidt,et al.  A Data-Driven Approach to A Priori SNR Estimation , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Pascal Scalart,et al.  Speech enhancement based on a priori signal to noise estimation , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[20]  Yu Tsao,et al.  Speech enhancement using generalized maximum a posteriori spectral amplitude estimator , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Peter Vary,et al.  Speech Enhancement by MAP Spectral Amplitude Estimation Using a Super-Gaussian Speech Model , 2005, EURASIP J. Adv. Signal Process..

[22]  Israel Cohen,et al.  Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging , 2003, IEEE Trans. Speech Audio Process..

[23]  Li Deng,et al.  High-performance robust speech recognition using stereo training data , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[24]  Daniel Povey,et al.  Revisiting Recurrent Neural Networks for robust ASR , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).