MMSE-Based Missing-Feature Reconstruction With Temporal Modeling for Robust Speech Recognition

This paper addresses the problem of feature compensation in the log-spectral domain by using the missing-data (MD) approach to noise robust speech recognition, that is, the log-spectral features can be either almost unaffected by noise or completely masked by it. First, a general MD framework based on minimum mean square error (MMSE) estimation is introduced which exploits the correlation across frequency bands to reconstruct the missing features. This framework allows the derivation of different MD imputation approaches and, in particular, a novel technique taking advantage of truncated Gaussian distributions is presented. While the proposed technique provides excellent results at high and medium signal-to-noise ratios (SNRs), its performance diminishes at low SNRs where very few reliable features are available. The reconstruction technique is therefore extended to exploit temporal constraints using two different approaches. In the first approach, time-frequency patches of speech containing a number of consecutive frames are modeled using a Gaussian mixture model (GMM). In the second one, the sequential structure of speech is alternatively modeled by a hidden Markov model (HMM). The proposed techniques are evaluated on Aurora-2 and Aurora-4 databases using both oracle and estimated masks. In both cases, the proposed techniques outperform the recognition performance obtained by the baseline system and other related techniques. Also, the introduction of a temporal modeling turns out to be very effective in reconstructing spectra at low SNRs. In particular, HMMs show the highest capability of accounting for time correlations and, therefore, achieve the best results.

[1]  John H. L. Hansen,et al.  Missing-Feature Reconstruction by Leveraging Temporal Spectral Correlation for Robust Speech Recognition in Background Noise Conditions , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[3]  Harvey b. Fletcher,et al.  Speech and hearing in communication , 1953 .

[4]  Richard M. Stern,et al.  A Bayesian classifier for spectrographic mask estimation for missing feature speech recognition , 2004, Speech Commun..

[5]  James R. Glass,et al.  Updated MINDS Report on Speech Recognition and Understanding, Part 2 , 2009 .

[6]  Jon Barker,et al.  Soft decisions in missing data techniques for robust automatic speech recognition , 2000, INTERSPEECH.

[7]  Alejandro Acero,et al.  Acoustical and environmental robustness in automatic speech recognition , 1991 .

[8]  Ángel M. Gómez,et al.  Efficient MMSE Estimation and Uncertainty Processing for Multienvironment Robust Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Ning Ma,et al.  Log-spectral feature reconstruction based on an occlusion model for noise robust speech recognition , 2012, INTERSPEECH.

[10]  Li Deng,et al.  Estimating cepstrum of speech under the presence of noise using a joint prior of static and dynamic features , 2004, IEEE Transactions on Speech and Audio Processing.

[11]  James Glass,et al.  Research Developments and Directions in Speech Recognition and Understanding, Part 1 , 2009 .

[12]  Daniel P. W. Ellis,et al.  Decoding speech in the presence of other sources , 2005, Speech Commun..

[13]  Ning Ma,et al.  Exploiting correlogram structure for robust speech recognition with multiple speech sources , 2007, Speech Commun..

[14]  Richard M. Stern,et al.  Reconstruction of missing features for robust speech recognition , 2004, Speech Commun..

[15]  Hugo Van hamme,et al.  PROSPECT features and their application to missing data techniques for robust speech recognition , 2004, INTERSPEECH.

[16]  Hugo Van hamme,et al.  Model-based feature enhancement with uncertainty decoding for noise robust ASR , 2006, Speech Commun..

[17]  Jean Paul Haton,et al.  On noise masking for automatic missing data speech recognition: A survey and discussion , 2007, Comput. Speech Lang..

[18]  Richard C. Rose,et al.  Mask estimation in non-stationary noise environments for missing feature based robust speech recognition , 2010, INTERSPEECH.

[19]  Pei Zhao,et al.  On using missing-feature theory with cepstral features - approximations to the multivariate integral , 2010, INTERSPEECH.

[20]  Ning Ma,et al.  Combining missing-data reconstruction and uncertainty decoding for robust speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  N. L. Johnson,et al.  Continuous Univariate Distributions. , 1995 .

[22]  Richard M. Stern,et al.  Mask classification for missing-feature reconstruction for robust speech recognition in unknown background noise , 2011, Speech Commun..

[23]  Sam T. Roweis,et al.  Factorial models and refiltering for speech separation and denoising , 2003, INTERSPEECH.

[24]  Martin Cooke,et al.  A glimpsing model of speech perception in noise. , 2006, The Journal of the Acoustical Society of America.

[25]  Combining Speech Fragment Decoding and Adaptive Noise Floor Modeling , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Roger K. Moore,et al.  Hidden Markov model decomposition of speech and noise , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[27]  T. Crystal,et al.  Segmental durations in connected‐speech signals: Current results , 1988 .

[28]  Hugo Van hamme,et al.  Compressive Sensing for Missing Data Imputation in Noise Robust Speech Recognition , 2010, IEEE Journal of Selected Topics in Signal Processing.

[29]  Sam T. Roweis,et al.  One Microphone Source Separation , 2000, NIPS.

[30]  John R. Hershey,et al.  Single-Channel Multitalker Speech Recognition , 2010, IEEE Signal Processing Magazine.

[31]  Abeer Alwan,et al.  HMM-Based Reconstruction of Unreliable Spectrographic Data for Noise Robust Speech Recognition , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[32]  Saeed Vaseghi,et al.  Speech recognition in noisy environments , 1992, ICSLP.

[33]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[34]  R. M. Warren,et al.  Spectral redundancy: Intelligibility of sentences heard through narrow spectral slits , 1995, Perception & psychophysics.

[35]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[36]  Yifan Gong,et al.  Speech recognition in noisy environments: A survey , 1995, Speech Commun..

[37]  Hugo Van hamme,et al.  Advances in Missing Feature Techniques for Robust Large-Vocabulary Continuous Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[38]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[39]  Friedrich Faubel,et al.  BASED SOFT-MASK ESTIMATION FOR MISSING FEATURE RECONSTRUCTION , 2008 .

[40]  B. Raj,et al.  Reconstructing spectral vectors with uncertain spectrographic masks for robust speech recognition , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[41]  Jon Barker,et al.  A pitch based noise estimation technique for robust speech recognition with Missing Data , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  Abeer Alwan,et al.  A Statistical Approach to Mel-Domain Mask Estimation for Missing-Feature ASR , 2010, IEEE Signal Processing Letters.

[43]  Friedrich Faubel,et al.  Bounded conditional mean imputation with Gaussian mixture models: A reconstruction approach to partly occluded features , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[44]  A. Genz Numerical Computation of Multivariate Normal Probabilities , 1992 .

[45]  James R. Glass,et al.  Updated Minds Report on Speech Recognition and Understanding, Part 2 Citation Baker, J. Et Al. " Updated Minds Report on Speech Recognition and Understanding, Part 2 [dsp Education]. " Signal Processing Accessed Terms of Use , 2022 .

[46]  James R. Glass,et al.  Developments and directions in speech recognition and understanding, Part 1 [DSP Education] , 2009, IEEE Signal Processing Magazine.

[47]  José L. Pérez-Córdoba,et al.  HMM-based channel error mitigation and its application to distributed speech recognition , 2003, Speech Commun..

[48]  Ángel M. Gómez,et al.  MMSE-Based Packet Loss Concealment for CELP-Coded Speech Recognition , 2010, IEEE Transactions on Audio, Speech, and Language Processing.