This paper presents an audio-visual speaker-dependent continuous speech recognition system. The idea is to extract features from the audioand the videostream of a speaking person separately and use them to train a Hidden-Markov-Model based recognizer with the combined feature vectors. While the audio feature extraction follows a classical approach, the visual features are obtained by means of an advanced image processing algorithm which tracks certain regions on the speaker’s lips with high robustness and accuracy. For a self-generated audio-visual database, we compare the recognition rates of audio only, video only and audio-visual based recognition systems. We also compare the results of the audio only and the audiovisual systems under different noise conditions. The work is part of a larger project which aims at a new man-machine interface in the form of a so-called Virtual Personal Assistant which communicates with the user based on the multimodal integration of natural communication channels.
[1]
H. McGurk,et al.
Hearing lips and seeing voices
,
1976,
Nature.
[2]
Timothy F. Cootes,et al.
Lipreading Using Shape, Shading and Scale
,
1998,
AVSP.
[3]
Giridharan Iyengar,et al.
A cascade image transform for speaker independent automatic speechreading
,
2000,
2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).
[4]
Hervé Glotin,et al.
Large-vocabulary audio-visual speech recognition: a summary of the Johns Hopkins Summer 2000 Workshop
,
2001,
2001 IEEE Fourth Workshop on Multimedia Signal Processing (Cat. No.01TH8564).
[5]
Chalapathy Neti,et al.
Recent advances in the automatic recognition of audiovisual speech
,
2003,
Proc. IEEE.