Analysis of the visual Lombard effect and automatic recognition experiments

This study focuses on automatic visual speech recognition in the presence of noise. The authors show that, when speech is produced in noisy environments, articulatory changes occur because of the Lombard effect; these changes are both audible and visible. The authors analyze the visual Lombard effect and its role in automatic visual- and audiovisual speech recognition. Experimental results using both English and Japanese data demonstrate the negative effect of the Lombard effect in the visual speech domain. Without considering this factor in designing a lip-reading system, the performance of the system decreases. This is very important in audiovisual speech automatic recognition in real noisy environments. In such a case, however, the recognition rates decrease because of the presence of acoustic noise and because of the Lombard effect. The authors also show that the performance of an audiovisual speech recognizer depends also on the visual Lombard effect and can be further improved when it is considered in designing such a system.

[1]  Zinny S. Bond,et al.  A note on loud and lombard speech , 1990, ICSLP.

[2]  Yeunung Chen,et al.  Cepstral domain stress compensation for robust speech recogniton , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Tadashi Suzuki,et al.  Isolated word recognition using models for acoustic phonetic variability by lombard effect , 1994, ICSLP.

[4]  John H. L. Hansen,et al.  ICARUS: Source generator based real-time recognition of speech in noisy stressful and Lombard effect environments , 1995, Speech Commun..

[5]  Shihong Lao,et al.  Real-time face alignment with tracking in video , 2008, 2008 15th IEEE International Conference on Image Processing.

[6]  Fu Jie Huang,et al.  Consideration of Lombard effect for speechreading , 2001, 2001 IEEE Fourth Workshop on Multimedia Signal Processing (Cat. No.01TH8564).

[7]  Satoshi Nakamura,et al.  Multi-modal temporal asynchronicity modeling by product HMMs for robust audio-visual speech recognition , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[8]  B. Silverman,et al.  Kernel Density Estimation Using the Fast Fourier Transform , 1982 .

[9]  Peter L. Silsbee,et al.  Robust audiovisual integration using semicontinuous hidden Markov models , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[10]  Yung-Hwan Oh,et al.  Lombard effect compensation and noise suppression for noisy Lombard speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[11]  Stephen J. Cox,et al.  Combining noise compensation with visual information in speech recognition , 1997, AVSP.

[12]  Marion Dohen,et al.  An acoustic and articulatory study of Lombard speech: global effects on the utterance , 2006, INTERSPEECH.

[13]  David G. Stork,et al.  Visionary Speech: Looking Ahead to Practical Speechreading Systems , 1996 .

[14]  Yeunung Chen,et al.  Cepstral domain talker stress compensation for robust speech recognition , 1988, IEEE Trans. Acoust. Speech Signal Process..

[15]  Jeesun Kim,et al.  Lombard speech: Auditory (A), Visual (V) and AV effects , 2006 .

[16]  Francisco Casacuberta,et al.  An analysis of general acoustic-phonetic features for Spanish speech produced with the Lombard effect , 1996, Speech Commun..

[17]  Joan Fisher Box,et al.  Guinness, Gosset, Fisher, and Small Samples , 1987 .

[18]  Steve Young,et al.  The HTK hidden Markov model toolkit: design and philosophy , 1993 .

[19]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[20]  J C Junqua,et al.  The Lombard reflex and its role on human listeners and automatic speech recognizers. , 1993, The Journal of the Acoustical Society of America.

[21]  E. A. Martin,et al.  Multi-style training for robust isolated-word speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[22]  Kiyohiro Shikano,et al.  Unvoiced Speech Recognition Using Tissue-Conductive Acoustic Sensor , 2007, EURASIP J. Adv. Signal Process..

[23]  Nikos A. Vlassis,et al.  Efficient source adaptivity in independent component analysis , 2001, IEEE Trans. Neural Networks.

[24]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[25]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[26]  D. B. Paul A speaker-stress resistant HMM isolated word recognizer , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[27]  Kevin P. Murphy,et al.  A coupled HMM for audio-visual speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[28]  Tsuhan Chen,et al.  Audiovisual speech processing , 2001, IEEE Signal Process. Mag..

[29]  Thomas S. Huang,et al.  Bimodal speech recognition using coupled hidden Markov models , 2000, INTERSPEECH.

[30]  Steve Young,et al.  The HTK book , 1995 .

[31]  R. Muralishankar,et al.  A Comparative Analysis of Noise Robust Speech Features Extracted from All-Pass Based Warping with MFCC in a Noisy Phoneme Recognition , 2008, 2008 The Third International Conference on Digital Telecommunications (icdt 2008).

[32]  A. Adjoudani,et al.  On the Integration of Auditory and Visual Parameters in an HMM-based ASR , 1996 .

[33]  John H. L. Hansen,et al.  Analysis and compensation of speech under stress and noise for environmental robustness in speech recognition , 1996, Speech Commun..

[34]  Shuichi Itahashi,et al.  JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research , 1999 .

[35]  Kazuya Takeda,et al.  Variability of Lombard effects under different noise conditions , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.