A new posterior based audio-visual integration method for robust speech recognition

We describe the development of a multistream HMM based audio-visual speech recognition (AVSR) system and a new method for integrating the audio and visual streams using frame level posterior probabilities. This is compared to the standard feature concatenation and weighted product methods in speaker-dependent tests using our own multimodal database, by examining speech recognition robustness to corruption in either stream. For corruption in the audio stream we use additive noise at different SNR levels, and for corruption in the video stream we use MPEG4 compression at different bitrates as well as image blurring using Gaussian filters. We provide very promising results which demonstrate the robustness of the new method.

[1]  Juergen Luettin,et al.  Asynchronous stream modeling for large vocabulary audio-visual speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[2]  Martin Heckmann,et al.  Optimal weighting of posteriors for audio-visual speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[3]  Scott Axelrod,et al.  Maximum entropy and MCE based HMM stream weight estimation for audio-visual ASR , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[5]  Mubarak Shah,et al.  A Fast algorithm for active contours and curvature estimation , 1992, CVGIP Image Underst..

[6]  Eric D. Petajan Automatic lipreading to enhance speech recognition , 1984 .

[7]  Juergen Luettin,et al.  Hierarchical discriminant features for audio-visual LVCSR , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[8]  Mark A. Clements,et al.  Automatic Speechreading with Applications to Human-Computer Interfaces , 2002, EURASIP J. Adv. Signal Process..

[9]  Martin Heckmann,et al.  Noise Adaptive Stream Weighting in Audio-Visual Speech Recognition , 2002, EURASIP J. Adv. Signal Process..

[10]  Paul Duchnowski,et al.  Adaptive bimodal sensor fusion for automatic speechreading , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[11]  Chalapathy Neti,et al.  Frame-dependent multi-stream reliability indicators for audio-visual speech recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..