Detection of a Speaker in Video by Combined Analysis of Speech Sound and Mouth Movement

We present a robust method to detect and locate a speaker using a joint analysis of speech sound and video image. First, the short speech sound data is analyzed to estimate the rate of spoken syllables, and a difference image is formed using the optimal frame distance derived from the rate to detect the candidates of mouth. Then, they are tracked to positively prove that one of the candidates is the mouth; the rate of mouth movements is estimated from the brightness change profiles for the first candidate and, if both the rates agree, the three brightest parts are detected in the resulting difference image as mouth and eyes. If not, the second candidate is tracked and so on. The first-order moment of the power spectrum of the brightness change profile and the lateral shifts in the tracking are also used to check whether or not they are facial parts.

[1]  Mohan S. Kankanhalli,et al.  Experience based sampling technique for multimedia analysis , 2003, MULTIMEDIA '03.

[2]  Takeo Kanade,et al.  Name-It: Naming and Detecting Faces in News Videos , 1999, IEEE Multim..

[3]  Rama Chellappa,et al.  Human and machine recognition of faces: a survey , 1995, Proc. IEEE.

[4]  Narendra Ahuja,et al.  Detecting Faces in Images: A Survey , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .

[6]  Takeo Kanade,et al.  Neural Network-Based Face Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[8]  Demin Wang Unsupervised video segmentation based on watersheds and temporal tracking , 1998, IEEE Trans. Circuits Syst. Video Technol..

[9]  A. Murat Tekalp,et al.  Simultaneous alpha map generation and 2-D mesh tracking for multimedia applications , 1997, Proceedings of International Conference on Image Processing.

[10]  Osamu Ikeda,et al.  Segmentation of faces in video footage using HSV color for face detection and image retrieval , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[11]  Zhu Liu,et al.  Multimedia content analysis-using both audio and visual clues , 2000, IEEE Signal Process. Mag..

[12]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[13]  Federico Girosi,et al.  Training support vector machines: an application to face detection , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[14]  Takeo Kanade,et al.  Probabilistic modeling of local appearance and spatial relationships for object recognition , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).