Accurate, real-time, unadorned lip tracking

Human speech is inherently multi-modal, consisting of both audio and visual components. Recently researchers have shown that the incorporation of information about the position of the lips into acoustic speech recognisers enables robust recognition of noisy speech. In the case of Hidden Markov Model-recognition, we show that this happens because the visual signal stabilises the alignment of states. It is also shown, that unadorned lips, both the inner and outer contours, can be robustly tracked in real time on general-purpose workstations. To accomplish this, efficient algorithms are employed which contain three key components: shape models, motion models, and focused colour feature detectors-all of which are learnt from examples.

[1]  Yochai Konig,et al.  "Eigenlips" for robust speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Hans Peter Graf,et al.  Robust face feature analysis for automatic speechreading and character animation , 1996, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition.

[3]  A. Macleod,et al.  LIPS, TEETH, AND THE BENEFITS OF LIPREADING , 1989 .

[4]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[5]  Alexander H. Waibel,et al.  Toward movement-invariant automatic lip-reading and speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[6]  M.E. Hennecke,et al.  Automatic speech recognition system using acoustic and visual signals , 1995, Conference Record of The Twenty-Ninth Asilomar Conference on Signals, Systems and Computers.

[7]  Robert A. Kaucic Lip Tracking for Audio-Visual Speech Recognition. , 1997 .

[8]  Ioannis Pitas,et al.  Segmentation and tracking of faces in color images , 1996, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition.

[9]  Andrew Blake,et al.  Real-Time Lip Tracking for Audio-Visual Speech Recognition Applications , 1996, ECCV.

[10]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[11]  Michael Isard,et al.  Learning to Track the Visual Motion of Contours , 1995, Artif. Intell..

[12]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[13]  Hans Peter Graf,et al.  Robust face feature analysis for automatic speechreading and character animation , 1996 .

[14]  Timothy F. Cootes,et al.  Building and using flexible models incorporating grey-level information , 1993, 1993 (4th) International Conference on Computer Vision.