Front-End Compensation Methods for LVCSR Under Lombard Effect

This study analyzes the impact of noisy background variations and Lombard effect (LE) on large vocabulary continuous speech recognition (LVCSR). Robustness of several front-end feature extraction strategies combined with state-of-the-art feature distribution normalizations is tested on neutral and Lombard speech from the UT-Scope database presented in two types of background noise at various levels of SNR. An extension of a bottleneck (BN) front-end utilizing normalization of both critical band energies (CRBE) and BN outputs is proposed and shown to provide a competitive performance compared to the best MFCC-based system. A novel MFCC-based BN front-end is introduced and shown to outperform all other systems in all conditions considered (average 4.1% absolute WER reduction over the second best system). Additionally, two phenomena are observed: (i) combination of cepstral mean subtraction and recently established RASTALP filtering significantly reduces transient effects of RASTA band-pass filtering and increases ASR robustness to noise and LE; (ii) histogram equalization may benefit from utilizing reference distributions derived from pre-normalized rather than raw training features, and also from adopting distributions from different front-ends. Index Terms: speech recognition, Lombard effect, UT-Scope database, bottleneck features, quantile-based cepstral distribution normalization, histogram equalization.

[1]  Frantisek Grézl,et al.  Optimizing bottle-neck features for lvcsr , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Jan Cernocký,et al.  TRAP-Based Techniques for Recognition of Noisy Speech , 2007, TSD.

[3]  Naoya Wada,et al.  Cepstral gain normalization for noise robust speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Martin Cooke,et al.  Speech production modifications produced by competing talkers, babble, and stationary noise. , 2008, The Journal of the Acoustical Society of America.

[5]  Petr Fousek,et al.  Data-driven design of front-end filter bank for Lombard speech recognition , 2006, INTERSPEECH.

[6]  Jan Cernocký,et al.  Probabilistic and Bottle-Neck Features for LVCSR of Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[7]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[8]  J C Junqua,et al.  The Lombard reflex and its role on human listeners and automatic speech recognizers. , 1993, The Journal of the Acoustical Society of America.

[9]  Mukund Padmanabhan,et al.  A nonlinear unsupervised adaptation technique for speech recognition , 2000, INTERSPEECH.

[10]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[11]  Victor Zue,et al.  Speech database development at MIT: Timit and beyond , 1990, Speech Commun..

[12]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[13]  Hynek Hermansky,et al.  Robust ASR front-end using spectral-based and discriminant features: experiments on the Aurora tasks , 2001, INTERSPEECH.

[14]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[15]  John H. L. Hansen,et al.  A new perceptually motivated MVDR-based acoustic front-end (PMVDR) for robust automatic speech recognition , 2008, Speech Commun..

[16]  Ramesh A. Gopinath,et al.  Maximum likelihood modeling with Gaussian distributions for classification , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[17]  Radek Skarnitzl,et al.  Confronting HMM-based phone labelling with human evaluation of speech production , 2005, INTERSPEECH.

[18]  Sridha Sridharan,et al.  Feature warping for robust speaker verification , 2001, Odyssey.

[19]  Nelson Morgan,et al.  Learning long-term temporal features in LVCSR using neural networks , 2004, INTERSPEECH.

[20]  John H. L. Hansen,et al.  UT-Scope: Towards LVCSR under Lombard effect induced by varying types and levels of noisy background , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  John H. L. Hansen,et al.  Analysis and Compensation of Lombard Speech Across Noise Type and Levels With Application to In-Set/Out-of-Set Speaker Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  John H. L. Hansen,et al.  Analysis and compensation of speech under stress and noise for environmental robustness in speech recognition , 1996, Speech Commun..