A Novel Mask Estimation Method Employing Posterior-Based Representative Mean Estimate for Missing-Feature Speech Recognition

This paper proposes a novel mask estimation method for missing-feature reconstruction to improve speech recognition performance in various types of background noise conditions. A conventional mask estimation method based on spectral subtraction degrades performance, due to incorrect estimation of the noise signal which fails to accurately represent the variations of background noise during the incoming speech utterance. The proposed mask estimation method utilizes a Posterior-based Representative Mean (PRM) estimate for determining the reliability of the input speech spectral components, which is obtained as a weighted sum of the mean parameters of the speech model using the posterior probability. To obtain the noise-corrupted speech model, a model combination method is employed, which was proposed in our previous study for a feature compensation method. Experimental results demonstrate that the proposed mask estimation method provides more separable distributions for the reliable/unreliable component classifier compared to the conventional mask estimation method. The recognition performance is evaluated using the Aurora 2.0 framework over various types of background noise conditions and the CU-Move real-life in-vehicle corpus. The performance evaluation shows that the proposed mask estimation method is considerably more effective at increasing speech recognition performance in various types of background noise conditions, compared to the conventional mask estimation method which is based on spectral subtraction. By employing the proposed PRM-based mask estimation for missing-feature reconstruction, we obtain +23.41% and +9.45% average relative improvements in word error rate for all four types of noise conditions and CU-Move corpus, respectively, compared to conventional mask estimation methods.

[1]  John H. L. Hansen,et al.  Feature compensation in the cepstral domain employing model combination , 2009, Speech Commun..

[2]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[3]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[4]  Fionn Murtagh,et al.  High-likelihood model based on reliability statistics for robust combination of features: application to noisy speech recognition , 2003, INTERSPEECH.

[5]  John H. L. Hansen,et al.  Time–Frequency Correlation-Based Missing-Feature Reconstruction for Robust Speech Recognition in Band-Restricted Conditions , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Richard M. Stern,et al.  A Bayesian classifier for spectrographic mask estimation for missing feature speech recognition , 2004, Speech Commun..

[7]  John H. L. Hansen,et al.  CU-Move: Advanced In-Vehicle Speech Systems for Route Navigation , 2005 .

[8]  Hugo Van hamme,et al.  Joint removal of additive and convolutional noise with model-based feature enhancement , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  R.M. Stern,et al.  Missing-feature approaches in speech recognition , 2005, IEEE Signal Processing Magazine.

[10]  John H. L. Hansen,et al.  Constrained iterative speech enhancement with application to speech recognition , 1991, IEEE Trans. Signal Process..

[11]  Richard M. Stern,et al.  Band-Independent Mask Estimation for Missing-Feature Reconstruction in the Presence of Unknown Background Noise , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[12]  Jon Barker,et al.  Robust ASR based on clean speech models: an evaluation of missing data techniques for connected digit recognition in noise , 2001, INTERSPEECH.

[13]  Satoshi Nakamura,et al.  HMM-based feature compensation method: an evaluation using the AURORA2 , 2004, INTERSPEECH.

[14]  A. Sathyanarayana,et al.  UTDrive: Driver Behavior and Speech Interactive Systems for In-Vehicle Environments , 2007, 2007 IEEE Intelligent Vehicles Symposium.

[15]  Guy J. Brown,et al.  Techniques for handling convolutional distortion with 'missing data' automatic speech recognition , 2004, Speech Commun..

[16]  DeLiang Wang,et al.  Binary and ratio time-frequency masks for robust speech recognition , 2006, Speech Commun..

[17]  John H. L. Hansen,et al.  Mask estimation employing Posterior-based Representative Mean for missing-feature speech recognition with time-varying background noise , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[18]  Nam Soo Kim,et al.  Feature domain compensation of nonstationary noise for robust speech recognition , 2002, Speech Commun..

[19]  Richard M. Stern,et al.  Reconstruction of missing features for robust speech recognition , 2004, Speech Commun..

[20]  H. Van hamme,et al.  Robust speech recognition using cepstral domain missing data techniques and noisy masks , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[21]  John H. L. Hansen,et al.  Missing-feature reconstruction for band-limited speech recognition in spoken document retrieval , 2006, INTERSPEECH.

[22]  Richard M. Stern,et al.  Environment-independent mask estimation for missing-feature reconstruction , 2005, INTERSPEECH.

[23]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[24]  Richard M. Stern,et al.  Data-driven environmental compensation for speech recognition: A unified approach , 1998, Speech Commun..

[25]  Elaine Marsh,et al.  Speech in noisy environments (spine) adds new dimension to speech recognition R&D , 2002 .

[26]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[27]  John H. L. Hansen,et al.  Analysis and compensation of speech under stress and noise for environmental robustness in speech recognition , 1996, Speech Commun..

[28]  John H. L. Hansen,et al.  Morphological constrained feature enhancement with adaptive cepstral compensation (MCE-ACC) for speech recognition in noise and Lombard effect , 1994, IEEE Trans. Speech Audio Process..

[29]  Richard M. Stern,et al.  Spatial separation of speech signals using amplitude estimation based on interaural comparisons of zero-crossings , 2009, Speech Commun..

[30]  Phil D. Green,et al.  Missing data theory, spectral subtraction and signal-to-noise estimation for robust ASR: an integrated study , 1999, EUROSPEECH.

[31]  Guy J. Brown,et al.  Mask estimation based on sound localisation for missing data speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[32]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[33]  Gaël Richard,et al.  The speechdat-car multilingual speech databases for in-car applications: some first validation results , 1999, EUROSPEECH.

[34]  Mark J. F. Gales,et al.  Robust continuous speech recognition using parallel model combination , 1996, IEEE Trans. Speech Audio Process..

[35]  Rainer Martin,et al.  Spectral Subtraction Based on Minimum Statistics , 2001 .

[36]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[37]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[38]  John H. L. Hansen,et al.  Speechfind for CDP: Advances in spoken document retrieval for the U. S. collaborative digitization program , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).