Robust automatic speech recognition using acoustic model adaptation prior to missing feature reconstruction

When speech recognition is used in real-world environments, simultaneous speaker and environmental adaptation and compensation for time-varying noise effects is needed. Noise compensation methods like missing feature reconstruction should be combined with adaptation methods like constrained maximum likelihood linear regression (CMLLR). This is only straightforward if reconstruction is used prior to CMLLR. In this work, reconstruction is modified so that we can estimate CMLLR transformations prior to reconstruction. The new approach is evaluated on large vocabulary speech data recorded in noisy public and car environments and compared to using reconstruction prior to CMLLR estimation. The results suggest the noise environment determines which approach is better. Using adaptation prior to reconstruction has the better performance when evaluated on data from public environments. The relative reductions in letter error rate were 47-50 % compared to the baseline and 13-19 % compared to using either adaptation or reconstruction alone.

[1]  Lei He,et al.  Robust Automatic Speech Recognition for Acce , 2006 .

[2]  Hugo Van hamme,et al.  Vector-quantization based mask estimation for missing data automatic speech recognition , 2007, INTERSPEECH.

[3]  Mikko Kurimo,et al.  Missing feature reconstruction and acoustic model adaptation combined for large vocabulary continuous speech recognition , 2008, 2008 16th European Signal Processing Conference.

[4]  Krzysztof Marasek,et al.  SPEECON – Speech Databases for Consumer Devices: Database Specification and Validation , 2002, LREC.

[5]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[6]  Mikko Kurimo,et al.  Duration modeling techniques for continuous speech recognition , 2004, INTERSPEECH.

[7]  Janne Pylkkönen AN EFFICIENT ONE-PASS DECODER FOR FINNISH LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION , .

[8]  Mikko Kurimo,et al.  Unlimited vocabulary speech recognition with morph language models applied to Finnish , 2006, Comput. Speech Lang..

[9]  Mark J. F. Gales,et al.  Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[10]  Richard M. Stern,et al.  Reconstruction of missing features for robust speech recognition , 2004, Speech Commun..

[11]  Jj Odell,et al.  The Use of Context in Large Vocabulary Speech Recognition , 1995 .

[12]  Vesa Siivola,et al.  Growing an n-gram language model , 2005, INTERSPEECH.

[13]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..