Lip-Reading Technique Using Spatio-Temporal Templates and Support Vector Machines

This paper presents a lip-reading technique to identify the unspoken phones using support vector machines. The proposed system is based on temporal integration of the video data to generate spatio-temporal templates (STT). 64 Zernike moments (ZM) are extracted from each STT. This work proposes a novel feature selection algorithm to reduce the dimensionality of the 64 ZM to 12 features. The proposed technique uses the shape of probability curve as a goodness measure for optimal feature selection. The feature vectors are classified using non-linear support vector machines.Such a system could be invaluable when it is important to communicate without making a sound, such as giving passwords when in public spaces.

[1]  Jing Huang,et al.  Towards practical deployment of audio-visual speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  U. M. Feyyad Data mining and knowledge discovery: making sense out of data , 1996 .

[3]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[4]  Josef Kittler,et al.  Audio- and Video-Based Biometric Person Authentication, 5th International Conference, AVBPA 2005, Hilton Rye Town, NY, USA, July 20-22, 2005, Proceedings , 2005, AVBPA.

[5]  Roland T. Chin,et al.  On Image Analysis by the Methods of Moments , 1988, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[8]  Juergen Luettin,et al.  Acoustic-labial speaker verification , 1997, Pattern Recognit. Lett..

[9]  Liang Dong,et al.  Recognition of Visual Speech Elements Using Hidden Markov Models , 2002, IEEE Pacific Rim Conference on Multimedia.

[10]  Dinesh Kant Kumar,et al.  Visual recognition of speech consonants using facial movement features , 2007, Integr. Comput. Aided Eng..

[11]  Kuntal Sengupta,et al.  Audio-visual modeling for bimodal speech recognition , 2001, 2001 IEEE International Conference on Systems, Man and Cybernetics. e-Systems and e-Man for Cybernetics in Cyberspace (Cat.No.01CH37236).

[12]  Guojun Lu,et al.  Review of shape representation and description techniques , 2004, Pattern Recognit..

[13]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[14]  Alireza Khotanzad,et al.  Invariant Image Recognition by Zernike Moments , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Josef Bigün,et al.  Synergy of Lip-Motion and Acoustic Features in Biometric Speech and Speaker Recognition , 2007, IEEE Transactions on Computers.

[16]  Jian Zhang,et al.  Analysis of lip geometric features for audio-visual speech recognition , 2004, IEEE Trans. Syst. Man Cybern. Part A.