Detection of Mouth Movements and its Applications to Cross-Modal Analysis of Planning Meetings

Detection of meaningful meeting events is very important for cross-modal analysis of planning meetings. Many important events are related to speaker's communication behavior. In visual-audio based speaker detection, mouth positions and movements are needed as visual information. We present our techniques to detect mouth positions and movements of a talking person in meetings. First, we build a skin color model with the Gaussian distribution. After training with skin color samples, we obtain parameters for the model. A skin color filter is created corresponding to the model with a threshold. We detect face regions for all participants in the meeting. Second, We create a mouth template and perform image matching to find candidates of the mouth in each face region. Next, according to the fact that the skin color in lip areas is different from other areas in the face region, by comparing dissimilarities of skin color between candidates and the original color model, we decide the mouth area from the candidates. Finally, we detect mouth movements by computing normalized cross-correlation coefficients of mouth area between two successive frames. A real-time system has been implemented to track speaker's mouth positions and detection mouth movements. Applications also include video conferencing and improving human computer interaction (HCI). Examples in meeting environments and others are provided.

[1]  David McNeill Cognitive Science of Gesture. Growth Points, Catchments, and Contexts. , 2000 .

[2]  Gregory J. Wolff,et al.  Neural network lipreading system for improved speech recognition , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[3]  A. Yuille Deformable Templates for Face Recognition , 1991, Journal of Cognitive Neuroscience.

[4]  Yochai Konig,et al.  "Eigenlips" for robust speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Markus Kampmann Automatic 3-D face model adaptation for model-based coding of videophone sequences , 2002, IEEE Trans. Circuits Syst. Video Technol..

[6]  Alexander H. Waibel,et al.  Skin-Color Modeling and Adaptation , 1998, ACCV.

[7]  David Taylor Hearing by Eye: The Psychology of Lip-Reading , 1988 .

[8]  David McNeill,et al.  Growth Points, Catchments, and Contexts , 2000 .

[9]  Rogério Schmidt Feris,et al.  Hierarchical wavelet networks for facial feature localization , 2002, Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition.

[10]  Marius Malciu,et al.  A robust model-based approach for 3D head tracking in video sequences , 2000, Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580).

[11]  Francis K. H. Quek,et al.  Meeting room configuration and multiple camera calibration in meeting analysis , 2005, ICMI '05.