Switching Linear Dynamic Models for Recognition of Emotionally Colored and Noisy Speech

Model-based speech feature enhancement techniques were shown to be a promising approach towards increasing the robustness of automatic speech recognition in noisy environments. Strategies that model speech with a Switching Linear Dynamic Model (SLDM) have been successfully applied to noisy speech recognition tasks, since they overcome the limitations of GMM- or HMM-based approaches. However, SLDM-based feature enhancement has so far only been investigated for the recognition of isolated words or relatively friendly scenarios such as connected digit recognition under the presence of additive noise using whole word models (e. g. the AURORA task). In order to give an impression of the effectiveness of SLDM speech modeling for more challenging ASR applications, we evaluate SLDM feature enhancement for continuous recognition of spontaneous and emotionally colored speech in the noise. As backend we use tied-state triphone models trained and evaluated on the SAL Corpus. Applying SLDM-based feature enhancement, we achieve an average relative performance gain of almost 20 % when considering diverse noise settings.

[1]  Björn W. Schuller,et al.  Recognition of Noisy Speech: A Comparative Survey of Robust Model Architecture and Feature Enhancement , 2009, EURASIP J. Audio Speech Music. Process..

[2]  Björn W. Schuller,et al.  Combining Long Short-Term Memory and Dynamic Bayesian Networks for Incremental Emotion-Sensitive Artificial Listening , 2010, IEEE Journal of Selected Topics in Signal Processing.

[3]  José L. Pérez-Córdoba,et al.  Histogram equalization of speech representation for robust speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[4]  Rhee Man Kil,et al.  Auditory processing of speech signals for robust speech recognition in real-world noisy environments , 1999, IEEE Trans. Speech Audio Process..

[5]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[6]  Björn W. Schuller,et al.  Robust discriminative keyword spotting for emotionally colored spontaneous speech using bidirectional LSTM networks , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Reinhold Häb-Umbach,et al.  Modeling the dynamics of speech and noise for speech feature enhancement in ASR , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  H. Bourlard,et al.  Unsupervised spectral subtraction for noise-robust ASR , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[9]  Yaakov Bar-Shalom,et al.  Estimation and Tracking: Principles, Techniques, and Software , 1993 .

[10]  Björn W. Schuller,et al.  Speech recognition in noisy environments using a switching linear dynamic model for feature enhancement , 2008, INTERSPEECH.

[11]  Li Deng,et al.  Recursive estimation of nonstationary noise using iterative stochastic approximation for robust speech recognition , 2003, IEEE Trans. Speech Audio Process..

[12]  Björn W. Schuller,et al.  Bidirectional LSTM Networks for Context-Sensitive Keyword Detection in a Cognitive Virtual Agent Framework , 2010, Cognitive Computation.

[13]  Alex Acero,et al.  Noise robust speech recognition with a switching linear dynamic model , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Li Deng,et al.  A comparison of three non-linear observation models for noisy speech features , 2003, INTERSPEECH.

[15]  Tet Hin Yeap,et al.  Noisy Speech Feature Estimation on the Aurora2 Database using a Switching Linear Dynamic Model , 2007, J. Multim..

[16]  Kostas Karpouzis,et al.  The HUMAINE Database: Addressing the Collection and Annotation of Naturalistic and Induced Emotional Data , 2007, ACII.