Efficient MMSE Estimation and Uncertainty Processing for Multienvironment Robust Speech Recognition

This paper presents a feature compensation framework based on minimum mean square error (MMSE) estimation and stereo training data for robust speech recognition. In our proposal, we model the clean and noisy feature spaces in order to obtain clean feature estimates. However, unlike other well-known MMSE compensation methods such as SPLICE or MEMLIN, which model those spaces with Gaussian mixture models (GMMs), in our case every feature space is characterized by a set of prototype vectors which can be alternatively considered as a vector quantization (VQ) codebook. The discrete nature of this feature space characterization introduces two significative advantages. First, it allows the implementation of a very efficient MMSE estimator in terms of accuracy and computational cost. On the other hand, time correlations can be exploited by means of hidden Markov modeling (HMM). In addition, a novel subregion-based modeling is applied in order to accurately represent the transformation between the clean and noisy domains. In order to deal with unknown environments, a multiple-model approach is also explored. Since this approach has been shown quite sensitive to incorrect environment classification, we adapt two uncertainty processing techniques, soft-data decoding and exponential weighting, to our estimation framework. As a result, environment miss-classifications are concealed, allowing a better performance under unknown environments. The experimental results on noisy digit recognition show a relative improvement of 87.93% in word accuracy regarding the baseline when clean acoustic models are used, while a 4.54% is achieved with multi-style trained models.

[1]  Li Deng,et al.  Evaluation of the SPLICE algorithm on the Aurora2 database , 2001, INTERSPEECH.

[2]  Mark J. F. Gales,et al.  Robust continuous speech recognition using parallel model combination , 1996, IEEE Trans. Speech Audio Process..

[3]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[4]  Abeer Alwan,et al.  An efficient approximation of the forward-backward algorithm to deal with packet loss, with applications to remote speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Xiaodong Cui,et al.  MMSE-based stereo feature stochastic mapping for noise robust speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  José L. Pérez-Córdoba,et al.  HMM-based channel error mitigation and its application to distributed speech recognition , 2003, Speech Commun..

[7]  Chong Kwan Un,et al.  Speech recognition in noisy environments using first-order vector Taylor series , 1998, Speech Commun..

[8]  Yifan Gong,et al.  Speech recognition in noisy environments: A survey , 1995, Speech Commun..

[9]  Mark J. F. Gales,et al.  Joint uncertainty decoding for noise robust speech recognition , 2005, INTERSPEECH.

[10]  Antonio M. Peinado Speech Recognition Over Digital Channels: Robustness and Standards , 2006 .

[11]  Peter Vary,et al.  Softbit speech decoding: a new approach to error concealment , 2001, IEEE Trans. Speech Audio Process..

[12]  John H. L. Hansen,et al.  Feature compensation in the cepstral domain employing model combination , 2009, Speech Commun..

[13]  Alejandro Acero,et al.  Acoustical and environmental robustness in automatic speech recognition , 1991 .

[14]  Søren Holdt Jensen,et al.  Hidden Markov model-based packet loss concealment for voice over IP , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[16]  Li Deng,et al.  Recursive estimation of nonstationary noise using iterative stochastic approximation for robust speech recognition , 2003, IEEE Trans. Speech Audio Process..

[17]  Alfred Mertins,et al.  Automatic speech recognition and speech variability: A review , 2007, Speech Commun..

[18]  Ángel M. Gómez,et al.  MMSE-Based Packet Loss Concealment for CELP-Coded Speech Recognition , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Saeed Vaseghi,et al.  Speech recognition in noisy environments , 1992, ICSLP.

[20]  Oscar Saz-Torralba,et al.  Cepstral Vector Normalization Based on Stereo Data for Robust Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Mervyn A. Jack,et al.  Weighted Viterbi algorithm and state duration modelling for speech recognition in noise , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[22]  B. Atal Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. , 1974, The Journal of the Acoustical Society of America.

[23]  Javier Ramírez,et al.  Cepstral domain segmental nonlinear feature transformations for robust speech recognition , 2004, IEEE Signal Processing Letters.

[24]  Ángel M. Gómez,et al.  Combining Media-Specific FEC and Error Concealment for Robust Distributed Speech Recognition Over Loss-Prone Packet Channels , 2006, IEEE Transactions on Multimedia.

[25]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[26]  Abeer Alwan,et al.  HMM-based estimation of unreliable spectral components for noise robust speech recognition , 2008, INTERSPEECH.

[27]  José L. Pérez-Córdoba,et al.  Histogram equalization of speech representation for robust speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[28]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[29]  José L. Pérez-Córdoba,et al.  Efficient MMSE-based channel error mitigation techniques. Application to distributed speech recognition over wireless channels , 2005, IEEE Transactions on Wireless Communications.

[30]  Xiaodong Cui,et al.  Stereo-Based Stochastic Mapping for Robust Speech Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[32]  Paul Dalsgaard,et al.  Noise Condition-Dependent Training Based on Noise Classification and SNR Estimation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[33]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[34]  Ángel M. Gómez,et al.  Efficient VQ-based MMSE estimation for robust speech recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[35]  Mark J. F. Gales,et al.  Issues with uncertainty decoding for noise robust automatic speech recognition , 2008, Speech Commun..

[36]  Hugo Van hamme,et al.  Model-based feature enhancement with uncertainty decoding for noise robust ASR , 2006, Speech Commun..

[37]  Yifan Gong,et al.  Robust Speech Recognition Using a Cepstral Minimum-Mean-Square-Error-Motivated Noise Suppressor , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[38]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .