Environment mismatch compensation using average eigenspace-based methods for robust speech recognition

The performance of speech recognition systems is adversely affected by mismatch between training and test conditions due to environmental factors. In addition to the case of test data from noisy environments, there are scenarios where the training data itself is noisy. In this study, we propose a series of methods for mismatch compensation between training and test environments, based on our “average eigenspace” approach. These methods are also shown to be effective for non-stationary mismatch conditions. An advantage is that there is no need for explicit adaptation data since the method is applied to incoming test data to find the compensatory transform. We evaluate these approaches on two separate corpora which are collected from realistic car environments: CU-Move and UTDrive. Compared with a baseline system incorporating spectral subtraction, highpass filtering and cepstral mean normalization, we obtain a relative word error rate reduction of 17–26 % by applying the proposed techniques. These methods also result in a dimensionality reduction of the feature vectors allowing for a more compact set of acoustic models in the phoneme space, a property important for automatic speech recognition for small footprint size mobile devices such as cell-phone or PDA’s which require ASR in diverse environments.

[1]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[2]  John H. L. Hansen,et al.  Advances for In-Vehicle and Mobile Systems: Challenges for International Standards , 2007 .

[3]  J. Boudy,et al.  Non-linear spectral subtraction (NSS) and hidden Markov models for robust speech recognition in car noise environments , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[5]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[6]  Patrick Wambacq,et al.  Assessment of signal subspace based speech enhancement for noise robust speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Jean-Marc Vesin,et al.  Single channel speech enhancement using principal component analysis and MDL subspace selection , 1999, EUROSPEECH.

[8]  John H. L. Hansen,et al.  Time–Frequency Correlation-Based Missing-Feature Reconstruction for Robust Speech Recognition in Band-Restricted Conditions , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Richard M. Stern,et al.  Reconstruction of missing features for robust speech recognition , 2004, Speech Commun..

[10]  Hong Kook Kim,et al.  Cepstrum-Domain Model Combination Based on Decomposition of Speech and Noise Using MMSE-LSA for ASR in Noisy Environments , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[12]  John H. L. Hansen,et al.  CU-Move: Advanced In-Vehicle Speech Systems for Route Navigation , 2005 .

[13]  Richard M. Stern,et al.  Robust speech recognition in the automobile , 1994, ICSLP.

[14]  Tetsuya Takiguchi,et al.  Robust Feature Extraction using Kernel PCA , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[15]  John H. L. Hansen,et al.  Environment mismatch compensation using average eigenspace for speech recognition , 2008, INTERSPEECH.

[16]  John H. L. Hansen,et al.  UTDrive: The Smart Vehicle Project , 2009 .

[17]  Bhiksha Raj,et al.  Speech denoising using nonnegative matrix factorization with priors , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Antoine Souloumiac,et al.  Jacobi Angles for Simultaneous Diagonalization , 1996, SIAM J. Matrix Anal. Appl..

[19]  John H. L. Hansen,et al.  Rapid discriminative acoustic model based on eigenspace mapping for fast speaker adaptation , 2005, IEEE Transactions on Speech and Audio Processing.

[20]  Chin-Hui Lee,et al.  A maximum-likelihood approach to stochastic matching for robust speech recognition , 1996, IEEE Trans. Speech Audio Process..

[21]  John H. L. Hansen,et al.  Lombard effect compensation for robust automatic speech recognition in noise , 1990, ICSLP.

[22]  Jean-Claude Junqua,et al.  Maximum likelihood eigenspace and MLLR for speech recognition in noisy environments , 1999, EUROSPEECH.

[23]  John H. L. Hansen,et al.  Constrained iterative speech enhancement with application to speech recognition , 1991, IEEE Trans. Signal Process..

[24]  John H. L. Hansen,et al.  CSA-BF: a constrained switched adaptive beamformer for speech enhancement and recognition in real car environments , 2003, IEEE Trans. Speech Audio Process..

[25]  Nikos Fakotakis,et al.  Independent component analysis applied to feature extraction for robust automatic speech recognition , 2000 .

[26]  John H. L. Hansen,et al.  Analysis and compensation of speech under stress and noise for environmental robustness in speech recognition , 1996, Speech Commun..

[27]  Sarah Kuester Dsp For In Vehicle And Mobile Systems , 2016 .

[28]  Ju Liu,et al.  The effectiveness of ICA-based representation: Application to speech feature extraction for noise robust speaker recognition , 2006, 2006 14th European Signal Processing Conference.

[29]  John H. L. Hansen,et al.  A comparative study of traditional and newly proposed features for recognition of speech under stress , 2000, IEEE Trans. Speech Audio Process..

[30]  Mark J. F. Gales Predictive model-based compensation schemes for robust speech recognition , 1998, Speech Commun..

[31]  Yifan Gong,et al.  A minimum-mean-square-error noise reduction algorithm on Mel-frequency cepstra for robust speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[32]  David G. Stork,et al.  Pattern Classification , 1973 .

[33]  Richard M. Stern,et al.  Data-driven environmental compensation for speech recognition: A unified approach , 1998, Speech Commun..