An Audio-Visual Particle Filter for Speaker Tracking on the CLEAR'06 Evaluation Dataset

We present an approach for tracking a lecturer during the course of his speech. We use features from multiple cameras and microphones, and process them in a joint particle filter framework. The filter performs sampled projections of 3D location hypotheses and scores them using features from both audio and video. On the video side, the features are based on foreground segmentation, multi-view face detection and upper body detection. On the audio side, the time delays of arrival between pairs of microphones are estimated with a generalized cross correlation function. In the CLEAR'06 evaluation, the system yielded a tracking accuracy (MOTA) of 71% for video-only, 55% for audio-only and 90% for combined audio-visual tracking.

[1]  Rainer Lienhart,et al.  An extended set of Haar-like features for rapid object detection , 2002, Proceedings. International Conference on Image Processing.

[2]  Patrick Pérez,et al.  Sequential Monte Carlo Fusion of Sound and Vision for Speaker Tracking , 2001, ICCV.

[3]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[4]  Larry S. Davis,et al.  Joint Audio-Visual Tracking Using Particle Filters , 2002, EURASIP J. Adv. Signal Process..

[5]  Paul A. Viola,et al.  Robust Real-time Object Detection , 2001 .

[6]  Jean-Marc Odobez,et al.  A Mixed-State I-Particle Filter for Multi-Camera Speaker Tracking , 2003, ICCV 2003.

[7]  Darren B. Ward,et al.  Particle filtering algorithms for tracking an acoustic source in a reverberant environment , 2003, IEEE Trans. Speech Audio Process..

[8]  John W. McDonough,et al.  Kalman Filters for Time Delay of Arrival-Based Source Localization , 2005, EURASIP J. Adv. Signal Process..

[9]  Maurizio Omologo,et al.  Acoustic event localization using a crosspower-spectrum phase based technique , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  B. Schiele,et al.  Fast and Robust Face Finding via Local Context , 2003 .

[11]  John W. McDonough,et al.  Microphone Array Driven Speech Recognition: Influence of Localization on the Word Error Rate , 2005, MLMI.

[12]  Jacob Benesty,et al.  Real-time passive source localization: a practical linear-correction least-squares approach , 2001, IEEE Trans. Speech Audio Process..

[13]  Jacob Benesty,et al.  Robust time delay estimation exploiting redundancy among multiple microphones , 2003, IEEE Trans. Speech Audio Process..

[14]  Michael S. Brandstein,et al.  A closed-form location estimator for use with room environment microphone arrays , 1997, IEEE Trans. Speech Audio Process..

[15]  H.K. Ekenel,et al.  Kalman filters for audio-video source localization , 2005, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005..

[16]  Michael Isard,et al.  CONDENSATION—Conditional Density Propagation for Visual Tracking , 1998, International Journal of Computer Vision.

[17]  Trevor Darrell,et al.  A Probabilistic Framework for Multi-modal Multi-Person Tracking , 2003, 2003 Conference on Computer Vision and Pattern Recognition Workshop.

[18]  Michael Shapiro Brandstein,et al.  A framework for speech source localization using sensor arrays , 1995 .