Visual words for lip-reading

In this paper, the automatic lip reading problem is investigated, and an innovative approach to providing solutions to this problem has been proposed. This new VSR approach is dependent on the signature of the word itself, which is obtained from a hybrid feature extraction method dependent on geometric, appearance, and image transform features. The proposed VSR approach is termed "visual words". The visual words approach consists of two main parts, 1) Feature extraction/selection, and 2) Visual speech feature recognition. After localizing face and lips, several visual features for the lips where extracted. Such as the height and width of the mouth, mutual information and the quality measurement between the DWT of the current ROI and the DWT of the previous ROI, the ratio of vertical to horizontal features taken from DWT of ROI, The ratio of vertical edges to horizontal edges of ROI, the appearance of the tongue and the appearance of teeth. Each spoken word is represented by 8 signals, one of each feature. Those signals maintain the dynamic of the spoken word, which contains a good portion of information. The system is then trained on these features using the KNN and DTW. This approach has been evaluated using a large database for different people, and large experiment sets. The evaluation has proved the visual words efficiency, and shown that the VSR is a speaker dependent problem.

[1]  Alan Wee-Chung Liew,et al.  Segmentation of color lip images by spatial fuzzy clustering , 2003, IEEE Trans. Fuzzy Syst..

[2]  Dahai Yu,et al.  The application of manifold based visual speech units for visual speech recognition , 2008 .

[3]  Yochai Konig,et al.  "Eigenlips" for robust speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Jean-Philippe Thiran,et al.  Audio-visual speech recognition with a hybrid SVM-HMM system , 2005, 2005 13th European Signal Processing Conference.

[5]  Sridhar P. Arjunan,et al.  Voiceless speech recognition using dynamic visual speech features , 2006 .

[6]  S. Sridharan,et al.  A visual front-end for a continuous pose-invariant lipreading system , 2008, 2008 2nd International Conference on Signal Processing and Communication Systems.

[7]  He Jun,et al.  Research on Visual Speech Feature Extraction , 2009, 2009 International Conference on Computer Engineering and Technology.

[8]  Kazunori Sugahara,et al.  Personal computer based real time lip reading system , 2000, WCC 2000 - ICSP 2000. 2000 5th International Conference on Signal Processing Proceedings. 16th World Computer Congress 2000.

[9]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[10]  Alexander Zelinsky,et al.  Automatic Extraction of Lip Feature Points , 2000 .

[11]  Jean-Philippe Thiran,et al.  Mutual information eigenlips for audio-visual speech recognition , 2006, 2006 14th European Signal Processing Conference.

[12]  Naseer Al-Jawad Exploiting statistical properties of wavelet coefficients for image/video processing and analysis tasks , 2009 .

[13]  A. Bovik,et al.  A universal image quality index , 2002, IEEE Signal Processing Letters.

[14]  Rin-ichiro Taniguchi,et al.  Appearance Feature Extraction versus Image Transform-Based Approach for Visual Speech Recognition , 2006, Int. J. Comput. Intell. Appl..

[15]  S. A. Khayam The Discrete Cosine Transform ( DCT ) : Theory and Application 1 , 2003 .

[16]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[17]  Wladyslaw Skarbek,et al.  Viseme recognition - a comparative study , 2005, IEEE Conference on Advanced Video and Signal Based Surveillance, 2005..

[18]  James R. Glass,et al.  A segment-based audio-visual speech recognizer: data collection, development, and initial experiments , 2004, ICMI '04.

[19]  Trevor Darrell,et al.  Production domain modeling of pronunciation for visual speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[20]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[21]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[22]  Sabah Jassim,et al.  A special purpose knowledge-based face localization method , 2008, SPIE Defense + Commercial Sensing.

[23]  Johan A. du Preez,et al.  Audio-Visual Speech Recognition using SciPy , 2010 .