A real-time automatic lipreading system

It's well known that visual information such as lip shape and its movement can indicate what the speaker is talking about. In this paper, we present an automatic lipreading system solely using visual information for recognizing isolated English digits from 0 to 9. A parameter set of a 14-point ASM lip model is used to describe the outer lip contour. The inner mouth information such as the teeth region and the mouth opening are also extracted. With appropriate normalization, the feature vectors containing the normalized outer lip features, inner mouth features and also their first order derivatives are obtained for training the HMM word models. Experiments have been carried out to investigate the recognition performance using our visual feature set compared with other traditional visual feature representations. An accuracy of 93% for speaker dependent recognition and 84% for speaker independent recognition is achieved using our visual feature representation. A real-time automatic lipreading system has been successfully implemented on a 1.9-GHz PC.

[1]  Alan Wee-Chung Liew,et al.  A new optimization procedure for extracting the point-based lip contour using active shape model , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[2]  Lorenzo Torresani,et al.  2D Deformable Models for Visual Speech Analysis , 1996 .

[3]  Michael T. Chan,et al.  HMM-based audio-visual speech recognition integrating geometric- and appearance-based visual features , 2001, 2001 IEEE Fourth Workshop on Multimedia Signal Processing (Cat. No.01TH8564).

[4]  Juergen Luettin,et al.  Active Shape Models for Visual Speech Feature Extraction , 1996 .

[5]  Alan Wee-Chung Liew,et al.  Lip contour extraction from color images using a deformable model , 2002, Pattern Recognit..

[6]  L. Venkata Subramaniam,et al.  Large vocabulary audio-visual speech recognition using active shape models , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[7]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[8]  Timothy F. Cootes,et al.  Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Gihad Rabi,et al.  Visual speech recognition by recurrent neural networks , 1997, CCECE '97. Canadian Conference on Electrical and Computer Engineering. Engineering Innovation: Voyage of Discovery. Conference Proceedings.

[10]  Cuntai Guan,et al.  Multi-model approach for noisy speech recognition , 1998 .

[11]  Shu Hung Leung,et al.  Lip image segmentation using fuzzy clustering incorporating an elliptic shape function , 2004, IEEE Transactions on Image Processing.

[12]  Kuntal Sengupta,et al.  Audio-visual modeling for bimodal speech recognition , 2001, 2001 IEEE International Conference on Systems, Man and Cybernetics. e-Systems and e-Man for Cybernetics in Cyberspace (Cat.No.01CH37236).