论文信息 - Audio-Visual Clustering for 3D Speaker Localization

Audio-Visual Clustering for 3D Speaker Localization

We address the issue of localizing individuals in a scene that contains several people engaged in a multiple-speaker conversation. We use a human-like configuration of sensors (binaural and binocular) to gather both auditory and visual observations. We show that the localization problem can be recast as the task of clustering the audio-visual observations into coherent groups. We propose a probabilistic generative model that captures the relations between audio and visual observations. This model maps the data to a representation of the common 3D scene-space, via a pair of Gaussian mixture models. Inference is performed by a version of the Expectation Maximization algorithm, which provides cooperative estimates of both the activity (speaking or not) and the 3D position of each speaker.

Radu Horaud | Miles E. Hansard | Florence Forbes | Elise Arnaud | Vasil Khalidov

[1] Jean Ponce,et al. Audio-Visual Speaker Localization Using Graphical Models , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[2] A. Blake,et al. Sequential Monte Carlo fusion of sound and vision for speaker tracking , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[3] Paul A. Viola,et al. Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[4] Larry S. Davis,et al. Joint Audio-Visual Tracking Using Particle Filters , 2002, EURASIP J. Adv. Signal Process..

[5] John W. McDonough,et al. A joint particle filter for audio-visual speaker tracking , 2005, ICMI '05.

[6] Patrick Pérez,et al. Data fusion for visual tracking with particles , 2004, Proceedings of the IEEE.

[7] Trevor Darrell,et al. Multiple person and speaker activity tracking with a particle filter , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8] D. Stork,et al. Speechreading by Man and Machine: Models, Systems, and Applications , 1996 .

[9] Ning Ma,et al. Integrating pitch and localisation cues at a speech fragment level , 2007, INTERSPEECH.

[10] Yoav Y. Schechner,et al. Harmony in Motion , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[11] Gilles Celeux,et al. EM procedures using mean field-like approximations for Markov model-based image segmentation , 2003, Pattern Recognit..

[12] Radu Horaud,et al. Patterns of Binocular Disparity for a Fixating Observer , 2007, BVAI.

[13] Jean-Marc Odobez,et al. Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[14] Dominic W. Massaro,et al. SPEECH RECOGNITION AND SENSORY INTEGRATION , 1998 .

[15] Trevor Darrell,et al. Speaker association with signal-level audiovisual fusion , 2004, IEEE Transactions on Multimedia.

[16] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[17] Yong Rui,et al. Real-time speaker tracking using particle filter sensor fusion , 2004, Proceedings of the IEEE.

[18] Nebojsa Jojic,et al. A Graphical Model for Audiovisual Object Tracking , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[19] G. Schwarz. Estimating the Dimension of a Model , 1978 .

[20] Christopher G. Harris,et al. A Combined Corner and Edge Detector , 1988, Alvey Vision Conference.

[21] Javier R. Movellan,et al. Channel Separability in the Audio-Visual Integration of Speech: A Bayesian Approach , 1996 .

[22] Martin Heckmann,et al. Noise Adaptive Stream Weighting in Audio-Visual Speech Recognition , 2002, EURASIP J. Adv. Signal Process..