AudioVisual Speech Recognition Using Motion Based Lipreading

This paper presents an audio-visual speaker-dependent continuous speech recognition system. The idea is to extract features from the audioand the videostream of a speaking person separately and use them to train a Hidden-Markov-Model based recognizer with the combined feature vectors. While the audio feature extraction follows a classical approach, the visual features are obtained by means of an advanced image processing algorithm which tracks certain regions on the speaker’s lips with high robustness and accuracy. For a self-generated audio-visual database, we compare the recognition rates of audio only, video only and audio-visual based recognition systems. We also compare the results of the audio only and the audiovisual systems under different noise conditions. The work is part of a larger project which aims at a new man-machine interface in the form of a so-called Virtual Personal Assistant which communicates with the user based on the multimodal integration of natural communication channels.

[1]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[2]  Timothy F. Cootes,et al.  Lipreading Using Shape, Shading and Scale , 1998, AVSP.

[3]  Giridharan Iyengar,et al.  A cascade image transform for speaker independent automatic speechreading , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[4]  Hervé Glotin,et al.  Large-vocabulary audio-visual speech recognition: a summary of the Johns Hopkins Summer 2000 Workshop , 2001, 2001 IEEE Fourth Workshop on Multimedia Signal Processing (Cat. No.01TH8564).

[5]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.