A comparison of front-end compensation strategies for robust LVCSR under room reverberation and increased vocal effort

Automatic speech recognition is known to deteriorate in the presence of room reverberation and variation of vocal effort in speakers. This study considers robustness of several state-of-the-art front-end feature extraction and normalization strategies to these sources of speech signal variability in the context of large vocabulary continuous speech recognition (LVCSR). A speech database recorded in an anechoic room, capturing modal speech and speech produced at different levels of vocal effort, is reverberated using measured room impulse responses and utilized in the evaluations. It is shown that the combination of recently introduced mean Hilbert envelope coefficients (MHEC) and a normalization strategy combining cepstral gain normalization and modified RASTA filtering (CGN_RASTALP) provides considerable recognition performance gains for reverberant modal and high vocal effort speech.

[1]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[2]  DeLiang Wang,et al.  A two-stage algorithm for one-microphone reverberant speech enhancement , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  John H. L. Hansen,et al.  UT-Scope: Towards LVCSR under Lombard effect induced by varying types and levels of noisy background , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Naoya Wada,et al.  Cepstral gain normalization for noise robust speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  J.-M. Boucher,et al.  A New Method Based on Spectral Subtraction for Speech Dereverberation , 2001 .

[6]  James L. Flanagan,et al.  Robust distant-talking speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[7]  Eric Jonsson,et al.  Front-End Compensation Methods for LVCSR Under Lombard Effect , 2011 .

[8]  R. Schulman,et al.  Articulatory dynamics of loud and normal speech. , 1989, The Journal of the Acoustical Society of America.

[9]  Richard M. Stern,et al.  Gammatone sub-band magnitude-domain dereverberation for ASR , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Brian Kingsbury,et al.  Recognizing reverberant speech with RASTA-PLP , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Sridha Sridharan,et al.  Feature warping for robust speaker verification , 2001, Odyssey.

[12]  Peter Vary,et al.  A binaural room impulse response database for the evaluation of dereverberation algorithms , 2009, 2009 16th International Conference on Digital Signal Processing.

[13]  J C Junqua,et al.  The Lombard reflex and its role on human listeners and automatic speech recognizers. , 1993, The Journal of the Acoustical Society of America.

[14]  Cheol-Ho Jeong,et al.  Vocal effort with changing talker-to-listener distance in different acoustic environments. , 2011, The Journal of the Acoustical Society of America.

[15]  Richard Schulman,et al.  Dynamic and perceptual constraints of loud speech , 1985 .

[16]  John H. L. Hansen,et al.  Front-End Compensation Methods for LVCSR Under Lombard Effect , 2011, INTERSPEECH.

[17]  Nelson Morgan,et al.  Double the trouble: handling noise and reverberation in far-field automatic speech recognition , 2002, INTERSPEECH.

[18]  John H. L. Hansen,et al.  Analysis and Compensation of Lombard Speech Across Noise Type and Levels With Application to In-Set/Out-of-Set Speaker Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  John H. L. Hansen,et al.  Analysis and compensation of speech under stress and noise for environmental robustness in speech recognition , 1996, Speech Commun..

[20]  H L HansenJohn Analysis and compensation of speech under stress and noise for environmental robustness in speech recognition , 1996 .

[21]  Mukund Padmanabhan,et al.  A nonlinear unsupervised adaptation technique for speech recognition , 2000, INTERSPEECH.

[22]  John H. L. Hansen,et al.  Hilbert envelope based features for robust speaker identification under reverberant mismatched conditions , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  A. Nabelek,et al.  Reverberant overlap- and self-masking in consonant identification. , 1989, The Journal of the Acoustical Society of America.