Lip localization and performance evaluation

Although impressive achievement has been made in the field of lip localization, there is no clear picture showing that which algorithms or methods have technical advantages over others. It is partly due to the fact that the commonly agreed evaluation methodology has not been established yet in the field. The common practice is using manual lip labelling as the ground truth for evaluation marked by human labellers. Our empirical results using the Expectation-Maximization procedure show that human labeller introduce subjective noises while labelling the lip-localization and often the subjective noises added can be too significant to be directly used for evaluation as ground truth. We argue in the paper that to train and evaluate a lip analysis system one has to measure the quality of human operators and infer the "ground truth" from the manual labelling, simultaneously. We demonstrate using BioID database how the Expectation- Maximization technology can be used for this purpose.

[1]  Kevin P. Murphy,et al.  A coupled HMM for audio-visual speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Michael J. Lyons,et al.  Automatic Classification of Single Facial Images , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Juergen Luettin,et al.  Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[4]  Alexander H. Waibel,et al.  Towards Unrestricted Lip Reading , 2000, Int. J. Pattern Recognit. Artif. Intell..

[5]  Timothy F. Cootes,et al.  Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Juergen Luettin,et al.  Active Shape Models for Visual Speech Feature Extraction , 1996 .

[7]  Javier R. Movellan,et al.  Visual Speech Recognition with Stochastic Networks , 1994, NIPS.

[8]  Paul Deléglise,et al.  The LIUM-AVS database : a corpus to test lip segmentation and speechreading systems in natural conditions , 2003, INTERSPEECH.

[9]  Alice Caplier,et al.  Accurate and quasi-automatic lip tracking , 2004, IEEE Transactions on Circuits and Systems for Video Technology.

[10]  Paul Deléglise,et al.  Statistical Lip-Appearance Models Trained Automatically Using Audio Information , 2002, EURASIP J. Adv. Signal Process..

[11]  Sridha Sridharan,et al.  Adaptive mouth segmentation using chromatic features , 2002, Pattern Recognit. Lett..

[12]  William M. Wells,et al.  Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image segmentation , 2004, IEEE Transactions on Medical Imaging.

[13]  Chong-Wah Ngo,et al.  Motion tracking of human mouth by generalized deformable models , 1999, Pattern Recognit. Lett..

[14]  Gerasimos Potamianos,et al.  Speaker independent audio-visual database for bimodal ASR , 1997, AVSP.

[15]  Jean-Philippe Thiran,et al.  The BANCA Database and Evaluation Protocol , 2003, AVBPA.

[16]  Jiri Matas,et al.  XM2VTSDB: The Extended M2VTS Database , 1999 .

[17]  A. Caplier,et al.  Automatic and Accurate Lip Tracking , 2003 .

[18]  Klaus J. Kirchberg,et al.  Robust Face Detection Using the Hausdorff Distance , 2001, AVBPA.