论文信息 - Audio-visual speaker tracking with importance particle filters

Audio-visual speaker tracking with importance particle filters

We present a probabilistic method for audio-visual (AV) speaker tracking, using an uncalibrated wide-angle camera and a micro- phone array. The algorithm fuses 2-D object shape and audio information via importance particle filters (I-PFs), allowing for the asymmetrical integration of AV information in a way that efficiently exploits the complementary features of each modality. Audio localization information is used to generate an importance sampling (IS) function, which guides the random search process of a particle filter towards regions of the configuration space likely to contain the true configuration (a speaker). The measurement process integrates contour-based and audio observations, which results in reliable head tracking in realistic scenarios. We show that imperfect single modalities can be combined into an algorithm that automatically initializes and tracks a speaker, switches between multiple speakers, tolerates visual clutter, and recovers from total AV object occlusion, in the context of a multimodal meeting room.

Jean-Marc Odobez | Daniel Gatica-Perez | Iain McCowan | Guillaume Lathoud | Darren Moore

[1] Larry S. Davis,et al. Multimodal 3-D tracking and event detection via the particle filter , 2001, Proceedings IEEE Workshop on Detection and Recognition of Events in Video.

[2] Vladimir Pavlovic,et al. Multimodal speaker detection using error feedback dynamic Bayesian networks , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[3] David B. Dunson,et al. Bayesian Data Analysis , 2010 .

[4] Michael Isard,et al. Active Contours , 2000, Springer London.

[5] Michael Isard,et al. ICONDENSATION: Unifying Low-Level and High-Level Tracking in a Stochastic Framework , 1998, ECCV.

[6] G. Carter,et al. The generalized correlation method for estimation of time delay , 1976 .

[7] Nando de Freitas,et al. Sequential Monte Carlo Methods in Practice , 2001, Statistics for Engineering and Information Science.

[8] Michael S. Brandstein,et al. A robust method for speech signal time-delay estimation in reverberant rooms , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9] Nebojsa Jojic,et al. Audio-Video Sensor Fusion with Probabilistic Graphical Models , 2002, ECCV.

[10] A. Blake,et al. Sequential Monte Carlo fusion of sound and vision for speaker tracking , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.