Real-time lip reading system for isolated Korean word recognition

This paper proposes a real-time lip reading system (consisting of a lip detector, lip tracker, lip activation detector, and word classifier), which can recognize isolated Korean words. Lip detection is performed in several stages: face detection, eye detection, mouth detection, mouth end-point detection, and active appearance model (AAM) fitting. Lip tracking is then undertaken via a novel two-stage lip tracking method, where the model-based Lucas-Kanade feature tracker is used to track the outer lip, and then a fast block matching algorithm is used to track the inner lip. Lip activation detection is undertaken through a neural network classifier, the input for which being a combination of the lip motion energy function and the first dominant shape feature. In the last step, input words are defined and recognized by three different classifiers: HMM, ANN, and K-NN. We combine the proposed lip reading system with an audio-only automatic speech recognition (ASR) system to improve the word recognition performance in the noisy environments. We then demonstrate the potential applicability of the combined system for use within hands free in-vehicle navigation devices. Results from experiments undertaken on 30 isolated Korean words using the K-NN classifier at a speed of 15fps demonstrate that the proposed lip reading system achieves a 92.67% word correct rate (WCR) for person-dependent tests, and a 46.50% WCR for person-independent tests. Also, the combined audio-visual ASR system increases the WCR from 0% to 60% in a noisy environment.

[1]  Andreas Ernst,et al.  Face detection with the modified census transform , 2004, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[2]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[3]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[4]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[5]  Simon Baker,et al.  Lucas-Kanade 20 Years On: A Unifying Framework , 2004, International Journal of Computer Vision.

[6]  Dinesh Kant Kumar,et al.  Visual recognition of speech consonants using facial movement features , 2007, Integr. Comput. Aided Eng..

[7]  Juergen Luettin,et al.  Audio-visual speech recognition , 2000 .

[8]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[9]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[10]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[11]  Sun-Kyung Kang,et al.  Design and implementation of a lip reading system in smart phone environment , 2009, 2009 IEEE International Conference on Information Reuse & Integration.

[12]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[13]  Simon Baker,et al.  Active Appearance Models Revisited , 2004, International Journal of Computer Vision.

[14]  Takeshi Saitoh,et al.  Analysis of efficient lip reading method for various languages , 2008, 2008 19th International Conference on Pattern Recognition.

[15]  Dinesh Kant Kumar,et al.  Lip-Reading Technique Using Spatio-Temporal Templates and Support Vector Machines , 2008, CIARP.

[16]  David G. Stork,et al.  Pattern Classification , 1973 .

[17]  Fernando Boto,et al.  An incremental and hierarchical k-NN classifier for handwritten characters , 2002, Object recognition supported by user interaction for service robots.

[18]  R. Mägi,et al.  Genetic Structure of Europeans: A View from the North–East , 2009, PloS one.

[19]  Liang Dong,et al.  A Two-Channel Training Algorithm for Hidden Markov Model and Its Application to Lip Reading , 2005, EURASIP J. Adv. Signal Process..

[20]  Timothy F. Cootes,et al.  Active Shape Models-Their Training and Application , 1995, Comput. Vis. Image Underst..

[21]  Timothy F. Cootes,et al.  Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Wei Ji Ma,et al.  Lip-Reading Aids Word Recognition Most in Moderate Noise: A Bayesian Explanation Using High-Dimensional Feature Space , 2009, PloS one.

[23]  Takahiro Ishikawa,et al.  The template update problem , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  L. R. Rabiner,et al.  A comparative study of several dynamic time-warping algorithms for connected-word recognition , 1981, The Bell System Technical Journal.