Audio-Visual Clustering for 3D Speaker Localization

We address the issue of localizing individuals in a scene that contains several people engaged in a multiple-speaker conversation. We use a human-like configuration of sensors (binaural and binocular) to gather both auditory and visual observations. We show that the localization problem can be recast as the task of clustering the audio-visual observations into coherent groups. We propose a probabilistic generative model that captures the relations between audio and visual observations. This model maps the data to a representation of the common 3D scene-space, via a pair of Gaussian mixture models. Inference is performed by a version of the Expectation Maximization algorithm, which provides cooperative estimates of both the activity (speaking or not) and the 3D position of each speaker.

[1]  Jean Ponce,et al.  Audio-Visual Speaker Localization Using Graphical Models , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[2]  A. Blake,et al.  Sequential Monte Carlo fusion of sound and vision for speaker tracking , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[3]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[4]  Larry S. Davis,et al.  Joint Audio-Visual Tracking Using Particle Filters , 2002, EURASIP J. Adv. Signal Process..

[5]  John W. McDonough,et al.  A joint particle filter for audio-visual speaker tracking , 2005, ICMI '05.

[6]  Patrick Pérez,et al.  Data fusion for visual tracking with particles , 2004, Proceedings of the IEEE.

[7]  Trevor Darrell,et al.  Multiple person and speaker activity tracking with a particle filter , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  D. Stork,et al.  Speechreading by Man and Machine: Models, Systems, and Applications , 1996 .

[9]  Ning Ma,et al.  Integrating pitch and localisation cues at a speech fragment level , 2007, INTERSPEECH.

[10]  Yoav Y. Schechner,et al.  Harmony in Motion , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Gilles Celeux,et al.  EM procedures using mean field-like approximations for Markov model-based image segmentation , 2003, Pattern Recognit..

[12]  Radu Horaud,et al.  Patterns of Binocular Disparity for a Fixating Observer , 2007, BVAI.

[13]  Jean-Marc Odobez,et al.  Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Dominic W. Massaro,et al.  SPEECH RECOGNITION AND SENSORY INTEGRATION , 1998 .

[15]  Trevor Darrell,et al.  Speaker association with signal-level audiovisual fusion , 2004, IEEE Transactions on Multimedia.

[16]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[17]  Yong Rui,et al.  Real-time speaker tracking using particle filter sensor fusion , 2004, Proceedings of the IEEE.

[18]  Nebojsa Jojic,et al.  A Graphical Model for Audiovisual Object Tracking , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[20]  Christopher G. Harris,et al.  A Combined Corner and Edge Detector , 1988, Alvey Vision Conference.

[21]  Javier R. Movellan,et al.  Channel Separability in the Audio-Visual Integration of Speech: A Bayesian Approach , 1996 .

[22]  Martin Heckmann,et al.  Noise Adaptive Stream Weighting in Audio-Visual Speech Recognition , 2002, EURASIP J. Adv. Signal Process..