Recognition of Noisy Speech: A Comparative Survey of Robust Model Architecture and Feature Enhancement

Performance of speech recognition systems strongly degrades in the presence of background noise, like the driving noise inside a car. In contrast to existing works, we aim to improve noise robustness focusing on all major levels of speech recognition: feature extraction, feature enhancement, speech modelling, and training. Thereby, we give an overview of promising auditory modelling concepts, speech enhancement techniques, training strategies, and model architecture, which are implemented in an in-car digit and spelling recognition task considering noises produced by various car types and driving conditions. We prove that joint speech and noise modelling with a Switching Linear Dynamic Model (SLDM) outperforms speech enhancement techniques like Histogram Equalisation (HEQ) with a mean relative error reduction of 52.7% over various noise types and levels. Embedding a Switching Linear Dynamical System (SLDS) into a Switching Autoregressive Hidden Markov Model (SAR-HMM) prevails for speech disturbed by additive white Gaussian noise.

[1]  Wu Chou,et al.  Minimum classification error linear regression for acoustic model adaptation of continuous density HMMs , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[2]  Hermann Ney,et al.  Quantile based histogram equalization for noise robust large vocabulary speech recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  James R. Glass,et al.  Noise Robust Phonetic Classificationwith Linear Regularized Least Squares and Second-Order Features , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[4]  B. Cranen,et al.  Noise reduction through compressed sensing , 2008, INTERSPEECH.

[5]  Tanja Schultz,et al.  Comparison of acoustic model adaptation techniques on non-native speech , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[6]  Björn W. Schuller,et al.  Speech recognition in noisy environments using a switching linear dynamic model for feature enhancement , 2008, INTERSPEECH.

[7]  Nam Soo Kim Nonstationary environment compensation based on sequential estimation , 1998, IEEE Signal Processing Letters.

[8]  Lee-Sup Kim,et al.  An advanced contrast enhancement using partially overlapped sub-block histogram equalization , 2001, IEEE Trans. Circuits Syst. Video Technol..

[9]  Denis Jouvet,et al.  Evaluation of a noise-robust DSR front-end on Aurora databases , 2002, INTERSPEECH.

[10]  David Barber,et al.  Expectation Correction for Smoothed Inference in Switching Linear Dynamical Systems , 2006, J. Mach. Learn. Res..

[11]  A. Stolcke,et al.  NOISE-RESISTANT FEATURE EXTRACTION AND MODEL TRAINING FOR ROBUST SPEECH RECOGNITION , 1996 .

[12]  Tet Hin Yeap,et al.  Noisy Speech Feature Estimation on the Aurora2 Database using a Switching Linear Dynamic Model , 2007, J. Multim..

[13]  Abeer Alwan,et al.  HMM-based estimation of unreliable spectral components for noise robust speech recognition , 2008, INTERSPEECH.

[14]  José L. Pérez-Córdoba,et al.  Histogram equalization of speech representation for robust speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[15]  Joseph Picone,et al.  Applications of support vector machines to speech recognition , 2004, IEEE Transactions on Signal Processing.

[16]  Jeff A. Bilmes,et al.  Maximum mutual information based reduction strategies for cross-correlation based joint distributional modeling , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[17]  C. Striebel,et al.  On the maximum likelihood estimates for linear dynamic systems , 1965 .

[18]  Björn W. Schuller,et al.  Static and Dynamic Modelling for the Recognition of Non-verbal Vocalisations in Conversational Speech , 2008, PIT.

[19]  David Barber,et al.  Switching Linear Dynamical Systems for Noise Robust Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[21]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[22]  Björn W. Schuller,et al.  Hidden Conditional Random Fields for Meeting Segmentation , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[23]  Steven Greenberg,et al.  Robust speech recognition using the modulation spectrogram , 1998, Speech Commun..

[24]  Hong Kook Kim,et al.  Cepstrum-domain acoustic feature compensation based on decomposition of speech and noise for ASR in noisy environments , 2001, IEEE Trans. Speech Audio Process..

[25]  Rhee Man Kil,et al.  Auditory processing of speech signals for robust speech recognition in real-world noisy environments , 1999, IEEE Trans. Speech Audio Process..

[26]  Saeed Vaseghi,et al.  Noise compensation methods for hidden Markov model speech recognition in adverse environments , 1997, IEEE Trans. Speech Audio Process..

[27]  Trevor Darrell,et al.  Conditional Random Fields for Object Recognition , 2004, NIPS.

[28]  Richard Lippmann,et al.  A comparison of signal processing front ends for automatic word recognition , 1995, IEEE Trans. Speech Audio Process..

[29]  Jean Paul Haton,et al.  Compensation of noise effects for robust speech recognition in car environments , 2000, INTERSPEECH.

[30]  Rainer Martin,et al.  SPEECH ENHANCEMENT IN THE DFT DOMAIN USING LAPLACIAN SPEECH PRIORS , 2003 .

[31]  G. R. Doddington,et al.  Computers: Speech recognition: Turning theory to practice: New ICs have brought the requisite computer power to speech technology; an evaluation of equipment shows where it stands today , 1981, IEEE Spectrum.

[32]  Saeed Vaseghi,et al.  Speech recognition in noisy environments , 1992, ICSLP.

[33]  Björn W. Schuller,et al.  Towards More Reality in the Recognition of Emotional Speech , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[34]  Misha Pavel,et al.  On the relative importance of various components of the modulation spectrum for automatic speech recognition , 1999, Speech Commun..

[35]  Jürgen Schmidhuber,et al.  An Application of Recurrent Neural Networks to Discriminative Keyword Spotting , 2007, ICANN.

[36]  Odette Scharenborg,et al.  The interspeech 2008 consonant challenge , 2008, INTERSPEECH.

[37]  Martin Bouchard,et al.  Comb filter decomposition for robust ASR , 2005, INTERSPEECH.

[38]  Hakan Erdogan,et al.  Incremental on-line feature space MLLR adaptation for telephony speech recognition , 2002, INTERSPEECH.

[39]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[40]  Li Deng,et al.  Recursive estimation of nonstationary noise using iterative stochastic approximation for robust speech recognition , 2003, IEEE Trans. Speech Audio Process..

[41]  Sarel van Vuuren,et al.  Relevance of time-frequency features for phonetic and speaker-channel classification , 2000, Speech Commun..

[42]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[43]  Detlev Langmann,et al.  Acoustic front ends for speaker-independent digit recognition in car environments , 1997, EUROSPEECH.

[44]  John N. Tsitsiklis,et al.  Introduction to Probability , 2002 .

[45]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[46]  Yaakov Bar-Shalom,et al.  Estimation and Tracking: Principles, Techniques, and Software , 1993 .

[47]  H. Bourlard,et al.  Unsupervised spectral subtraction for noise-robust ASR , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[48]  Charles M. Grinstead,et al.  Introduction to probability , 1999, Statistics for the Behavioural Sciences.

[49]  Alex Acero,et al.  Hidden conditional random fields for phone classification , 2005, INTERSPEECH.

[50]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[51]  Alex Acero,et al.  Noise robust speech recognition with a switching linear dynamic model , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[52]  Guillaume Lathoud Channel Normalization for Unsupervised Spectral Subtraction , 2006 .

[53]  Hynek Hermansky TRAP-TANDEM: data-driven extraction of temporal features from speech , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[54]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[55]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[56]  Trevor Darrell,et al.  Hidden Conditional Random Fields for Gesture Recognition , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[57]  Reinhold Häb-Umbach,et al.  Modeling the dynamics of speech and noise for speech feature enhancement in ASR , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[58]  Hynek Hermansky,et al.  RASTA-PLP speech analysis technique , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[59]  Brian Roark,et al.  Discriminative Language Modeling with Conditional Random Fields and the Perceptron Algorithm , 2004, ACL.

[60]  M.G. Rahim,et al.  Signal conditioning techniques for robust speech recognition , 1996, IEEE Signal Processing Letters.

[61]  Rahul Sarpeshkar,et al.  An FFT-Based Companding Front End for Noise-Robust Automatic Speech Recognition , 2007, EURASIP J. Audio Speech Music. Process..

[62]  Olli Viikki,et al.  Cepstral domain segmental feature vector normalization for noise robust speech recognition , 1998, Speech Commun..

[63]  Björn W. Schuller,et al.  On the Necessity and Feasibility of Detecting a Driver's Emotional State While Driving , 2007, ACII.

[64]  Hynek Hermansky,et al.  Evaluation and optimization of perceptually-based ASR front-end , 1993, IEEE Trans. Speech Audio Process..

[65]  Richard M. Stern,et al.  A vector Taylor series approach for environment-independent speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[66]  Abeer Alwan,et al.  On the use of variable frame rate analysis in speech recognition , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[67]  Peter Jancovic,et al.  On the mask modeling and feature representation in the missing-feature ASR: evaluation on the Consonant Challenge , 2008, INTERSPEECH.

[68]  Li Deng,et al.  A comparison of three non-linear observation models for noisy speech features , 2003, INTERSPEECH.

[69]  Richard M. Stern,et al.  Environmental robustness in automatic speech recognition , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[70]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[71]  W. Bruce Croft,et al.  Table extraction using conditional random fields , 2003, DG.O.

[72]  William J. J. Roberts,et al.  Revisiting autoregressive hidden Markov modeling of speech signals , 2005, IEEE Signal Processing Letters.

[73]  Adrião Duarte Dória Neto,et al.  Digit recognition using wavelet and SVM in Brazilian Portuguese , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.