Two-stage system for robust neutral/lombard speech recognition

Performance of current speech recognition systems is significantly deteriorated when exposed to strongly noisy environment. It can be attributed to background noise and Lombard effect (LE). Attempts for LE-robust systems often display a tradeoff between LE-specific improvements and the portability to neutral speech. Therefore, towards LE-robust recognition, it seems effective to use a set of conditionsdedicated subsystems driven by a condition classifier, rather than attempting for one universal recognizer. Presented paper focuses on a design of a two-stage recognition system (TSR) comprising talking style classifier (neutral/LE) followed by two style-dedicated recognizers differing in input features. First, the binary neutral/LE classifier is built, with a particular interest in developing suitable features for the classification. Second, performance of common speech features (MFCC, PLP), LE-robust features (Expolog) and newly proposed features is compared in neutral/LE digit recognition tasks. In addition, robustness to the changes of average speech pitch and various noise backgrounds is evaluated. Third, the TSR is built, employing two recognizers, each using style-specific features. Comparison of the proposed system with either neutralspecific or LE-specific recognizer on a joint neutral/LE speech shows an improvement 6.5o4.2 % WER on neutral and 48.1o28.4 % WER on LE Czech utterances.

[1]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[2]  Petr Fousek,et al.  Data-driven design of front-end filter bank for Lombard speech recognition , 2006, INTERSPEECH.

[3]  Petr Pollák,et al.  Design and collection of Czech Lombard speech database , 2005, INTERSPEECH.

[4]  Pavel Sovka,et al.  Czech language database of car speech and environmental noise , 1999, EUROSPEECH.

[5]  John H. L. Hansen,et al.  A comparative study of traditional and newly proposed features for recognition of speech under stress , 2000, IEEE Trans. Speech Audio Process..

[6]  J C Junqua,et al.  The Lombard reflex and its role on human listeners and automatic speech recognizers. , 1993, The Journal of the Acoustical Society of America.

[7]  P. Fousek,et al.  Lombard Speech Recognition : A Comparative Study , 2006 .

[8]  I R Titze,et al.  Vocal intensity in speakers and singers. , 1991, The Journal of the Acoustical Society of America.

[9]  D B Pisoni,et al.  An addendum to "Effects of Noise on Speech Production: Acoustic and Perceptual Analyses" [J. Acoust. Soc. Am. 84, 917-928 (1988)]. , 1989, The Journal of the Acoustical Society of America.

[10]  H. Lane,et al.  Regulation of voice communication by sensory dynamics. , 1970, The Journal of the Acoustical Society of America.

[11]  John H. L. Hansen,et al.  Analysis and compensation of speech under stress and noise for environmental robustness in speech recognition , 1996, Speech Commun..

[12]  Mark A. Clements,et al.  Analysis of glottal waveforms across stress styles , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[13]  John H. L. Hansen,et al.  N-channel hidden Markov models for combined stressed speech classification and recognition , 1999, IEEE Trans. Speech Audio Process..