Bayesian integration of audio and visual information for multi-target tracking using a CB-member filter

A new method is presented for integration of audio and visual information in multiple target tracking applications. The proposed approach uses a Bayesian filtering formulation and exploits multi-Bernoulli random finite set approximations. The work presented in this paper is the first principled Bayesian estimation approach to solve the sensor fusion problems that involve intermittent sensory data (e.g. audio data for a person who occasionally speaks.) We have examined our method with case studies from the SPEVI database. The results show nearly perfect tracking of people not only when they are silent but also when they are not visible to the camera (but speaking).

[1]  L. Davis,et al.  Background and foreground modeling using nonparametric kernel density estimation for visual surveillance , 2002, Proc. IEEE.

[2]  Ba-Ngu Vo,et al.  The Cardinality Balanced Multi-Target Multi-Bernoulli Filter and Its Implementations , 2009, IEEE Transactions on Signal Processing.

[3]  David Suter,et al.  Joint Detection and Estimation of Multiple Objects From Image Observations , 2010, IEEE Transactions on Signal Processing.

[4]  John W. McDonough,et al.  Audio-visual perception of a lecturer in a smart seminar room , 2006, Signal Processing.

[5]  Ronald P. S. Mahler,et al.  Statistical Multisource-Multitarget Information Fusion , 2007 .

[6]  A. Doucet,et al.  Sequential Monte Carlo methods for multitarget filtering with random finite sets , 2005, IEEE Transactions on Aerospace and Electronic Systems.

[7]  Andrew Rae,et al.  Particle filtering for bearing-only audio-visual speaker detection and tracking , 2009, 2009 3rd International Conference on Signals, Circuits and Systems (SCS).

[8]  Andrea Cavallaro,et al.  Multi-Modal Particle Filtering Tracking using Appearance, Motion and Audio Likelihoods , 2007, 2007 IEEE International Conference on Image Processing.

[9]  Andrea Cavallaro,et al.  Audio-assisted trajectory estimation in non-overlapping multi-camera networks , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  R. Hoseinnezhad,et al.  A Novel High Breakdown M-estimator for Visual Data Segmentation , 2007, 2007 IEEE 11th International Conference on Computer Vision.