Multimodal speaker localization in a probabilistic framework

A multimodal probabilistic framework is proposed for the problem of finding the active speaker in a video sequence. We localize the current speaker's mouth in the image by using the video and the audio channels together. We propose a novel visual feature that is well-suited for the analysis of the movement of the mouth. After estimating the joint probability density of the audio and visual features, we can find the most probable location of the current speaker's mouth in a sequence of images. The proposed method is tested on the CUAVE audio-visual database, yielding improved results, compared to other approaches from the literature.

[1]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[2]  Jean-Philippe Thiran,et al.  From error probability to information theoretic (multi-modal) signal processing , 2005, Signal Process..

[3]  Pierre Vandergheynst,et al.  Experimental evaluation framework for speaker detection on the CUAVE database , 2006 .

[4]  Javier R. Movellan,et al.  Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.

[5]  Jean-Philippe Thiran,et al.  A multimodal approach to extract optimized audio features for speaker detection , 2005, 2005 13th European Signal Processing Conference.

[6]  Trevor Darrell,et al.  Speaker association with signal-level audiovisual fusion , 2004, IEEE Transactions on Multimedia.

[7]  Sabri Gurbuz,et al.  Moving-Talker, Speaker-Independent Feature Study, and Baseline Results Using the CUAVE Multimodal Speech Corpus , 2002, EURASIP J. Adv. Signal Process..

[8]  Jean-Philippe Thiran,et al.  Feature space mutual information in speech-video sequences , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[9]  Malcolm Slaney,et al.  FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks , 2000, NIPS.

[10]  Harriet J. Nock,et al.  Assessing face and speech consistency for monologue detection in video , 2002, MULTIMEDIA '02.

[11]  Juergen Luettin,et al.  Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[12]  Trevor Darrell,et al.  Learning Joint Statistical Models for Audio-Visual Fusion and Segregation , 2000, NIPS.

[13]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[14]  Harriet J. Nock,et al.  Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study , 2003, CIVR.