Automatic Speech Recognition Using Missing Data Techniques: Handling of Real-World Data

In this chapter, we investigate the performance of a missing data recognizer on real-world speech from the SPEECON and SpeechDat-Car databases. In previous work we hypothesized that in real-world speech, which is corrupted not only by environmental noise, but also by speaker, reverberation and channel effects, the ‘reliable’ features do not match an acoustic model trained on clean speech. In a series of experiments, we investigate the validity of this hypothesis and explore to what extent performance can be improved by combining MDT with three conventional techniques, viz. multi-condition training, dereverberation and feature enhancement. Our results confirm our hypothesis and show that the mismatch can be reduced by multi-condition training of the acoustic models and feature enhancement, and that these effects combine to some degree. Our experiments with dereverberation reveal that reverberation can have a major impact on recognition performance, but that MDT with a suitable missing data mask is capable of compensating both the environmental noise as well as the reverberation at once.

[1]  Gaël Richard,et al.  The speechdat-car multilingual speech databases for in-car applications: some first validation results , 1999, EUROSPEECH.

[2]  Kai-Fu Lee,et al.  Automatic Speech Recognition , 1989 .

[3]  Ramón Fernández Astudillo,et al.  An Uncertainty Propagation Approach to Robust ASR Using the ETSI Advanced Front-End , 2010, IEEE Journal of Selected Topics in Signal Processing.

[4]  Dirk Van Compernolle,et al.  Fast and accurate acoustic modelling with semi-continuous HMMs , 1998, Speech Commun..

[5]  Philipos C. Loizou,et al.  A multi-band spectral subtraction method for enhancing speech corrupted by colored noise , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[7]  Hugo Van hamme Robust speech recognition using missing feature theory in the cepstral or LDA domain , 2003, INTERSPEECH.

[8]  Patrick Wambacq,et al.  SPRAAK: Speech Processing, Recognition and Automatic Annotation Kit , 2006, Essential Speech and Language Technology for Dutch.

[9]  Hugo Van hamme,et al.  PROSPECT features and their application to missing data techniques for robust speech recognition , 2004, INTERSPEECH.

[10]  Shinji Watanabe,et al.  Static and Dynamic Variance Compensation for Recognition of Reverberant Speech With Dereverberation Preprocessing , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Rainer Martin,et al.  Noise power spectral density estimation based on optimal smoothing and minimum statistics , 2001, IEEE Trans. Speech Audio Process..

[12]  John H. L. Hansen,et al.  Time–Frequency Correlation-Based Missing-Feature Reconstruction for Robust Speech Recognition in Band-Restricted Conditions , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Dirk Van Compernolle,et al.  Optimal feature sub-space selection based on discriminant analysis , 1999, EUROSPEECH.

[14]  R.M. Stern,et al.  Missing-feature approaches in speech recognition , 2005, IEEE Signal Processing Magazine.

[15]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[16]  Richard M. Stern,et al.  A Bayesian classifier for spectrographic mask estimation for missing feature speech recognition , 2004, Speech Commun..

[17]  Jean Paul Haton,et al.  Accurate marginalization range for missing data recognition , 2007, INTERSPEECH.

[18]  Krzysztof Marasek,et al.  SPEECON – Speech Databases for Consumer Devices: Database Specification and Validation , 2002, LREC.

[19]  Hugo Van hamme,et al.  Application of noise robust MDT speech recognition on the SPEECON and speechdat-car databases , 2009, INTERSPEECH.

[20]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[21]  Naveen Parihar,et al.  Analysis of the Aurora large vocabulary evaluations , 2003, INTERSPEECH.

[22]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[23]  Phil D. Green,et al.  State based imputation of missing data for robust speech recognition and speech enhancement , 1999, EUROSPEECH.

[24]  Mikko Kurimo,et al.  Missing feature reconstruction and acoustic model adaptation combined for large vocabulary continuous speech recognition , 2008, 2008 16th European Signal Processing Conference.

[25]  Guy J. Brown,et al.  Techniques for handling convolutional distortion with 'missing data' automatic speech recognition , 2004, Speech Commun..

[26]  Phil D. Green,et al.  Handling missing data in speech recognition , 1994, ICSLP.

[27]  Hugo Van hamme Robust speech recognition using cepstral domain missing data techniques and noisy masks , 2004, ICASSP.

[28]  Hugo Van hamme,et al.  Handling convolutional noise in missing data automatic speech recognition , 2006, INTERSPEECH.

[29]  Hugo Van hamme,et al.  Vector-quantization based mask estimation for missing data automatic speech recognition , 2007, INTERSPEECH.

[30]  Marc Moonen,et al.  Double-Talk-Robust Prediction Error Identification Algorithms for Acoustic Echo Cancellation , 2007, IEEE Transactions on Signal Processing.

[31]  Hugo Van hamme Handling Time-Derivative Features in a Missing Data Framework for Robust Automatic Speech Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[32]  Juan Manuel Górriz,et al.  Speech/non-speech discrimination based on contextual information integrated bispectrum LRT , 2006, IEEE Signal Processing Letters.

[33]  Richard M. Stern,et al.  Reconstruction of missing features for robust speech recognition , 2004, Speech Commun..

[34]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[35]  Patrick Wambacq,et al.  Improved parameter tying for efficient acoustic model evaluation in large vocabulary continuous speech recognition , 1998, ICSLP.

[36]  Veronique Stouten,et al.  Robust Automatic Speech Recognition in Time-Varying Environments (Robuuste automatische spraakherkenning in een tijdsvariërende omgeving) , 2006 .