Audio-visual speaker tracking with importance particle filters

We present a probabilistic method for audio-visual (AV) speaker tracking, using an uncalibrated wide-angle camera and a micro- phone array. The algorithm fuses 2-D object shape and audio information via importance particle filters (I-PFs), allowing for the asymmetrical integration of AV information in a way that efficiently exploits the complementary features of each modality. Audio localization information is used to generate an importance sampling (IS) function, which guides the random search process of a particle filter towards regions of the configuration space likely to contain the true configuration (a speaker). The measurement process integrates contour-based and audio observations, which results in reliable head tracking in realistic scenarios. We show that imperfect single modalities can be combined into an algorithm that automatically initializes and tracks a speaker, switches between multiple speakers, tolerates visual clutter, and recovers from total AV object occlusion, in the context of a multimodal meeting room.

[1]  Larry S. Davis,et al.  Multimodal 3-D tracking and event detection via the particle filter , 2001, Proceedings IEEE Workshop on Detection and Recognition of Events in Video.

[2]  Vladimir Pavlovic,et al.  Multimodal speaker detection using error feedback dynamic Bayesian networks , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[3]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[4]  Michael Isard,et al.  Active Contours , 2000, Springer London.

[5]  Michael Isard,et al.  ICONDENSATION: Unifying Low-Level and High-Level Tracking in a Stochastic Framework , 1998, ECCV.

[6]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[7]  Nando de Freitas,et al.  Sequential Monte Carlo Methods in Practice , 2001, Statistics for Engineering and Information Science.

[8]  Michael S. Brandstein,et al.  A robust method for speech signal time-delay estimation in reverberant rooms , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Nebojsa Jojic,et al.  Audio-Video Sensor Fusion with Probabilistic Graphical Models , 2002, ECCV.

[10]  A. Blake,et al.  Sequential Monte Carlo fusion of sound and vision for speaker tracking , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.