A joint particle filter for audio-visual speaker tracking

In this paper, we present a novel approach for tracking a lecturer during the course of his speech. We use features from multiple cameras and microphones, and process them in a joint particle filter framework. The filter performs sampled projections of 3D location hypotheses and scores them using features from both audio and video. On the video side, the features are based on foreground segmentation, multi-view face detection and upper body detection. On the audio side, the time delays of arrival between pairs of microphones are estimated with a generalized cross correlation function. Computationally expensive features are evaluated only at the particles' projected positions in the respective camera images, thus the complexity of the proposed algorithm is low. We evaluated the system on data that was recorded during actual lectures. The results of our experiments were 36 cm average error for video only tracking, 46 cm for audio only, and 31 cm for the combined audio-video system.

[1]  Paul A. Viola,et al.  Robust Real-time Object Detection , 2001 .

[2]  H. C. Schau,et al.  Passive source localization employing intersecting spherical surfaces from time-of-arrival differences , 1987, IEEE Trans. Acoust. Speech Signal Process..

[3]  Rainer Lienhart,et al.  An extended set of Haar-like features for rapid object detection , 2002, Proceedings. International Conference on Image Processing.

[4]  Darren B. Ward,et al.  Particle filtering algorithms for tracking an acoustic source in a reverberant environment , 2003, IEEE Trans. Speech Audio Process..

[5]  Maurizio Omologo,et al.  Acoustic event localization using a crosspower-spectrum phase based technique , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  John W. McDonough,et al.  Combining multi-source far distance speech recognition strategies: beamforming, blind channel and confusion network combination , 2005, INTERSPEECH.

[7]  Anoop Gupta,et al.  Automating camera management for lecture room environments , 2001, CHI.

[8]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[9]  Larry S. Davis,et al.  Joint Audio-Visual Tracking Using Particle Filters , 2002, EURASIP J. Adv. Signal Process..

[10]  Chalapathy Neti,et al.  Joint audio-visual speech processing for recognition and enhancement , 2003, AVSP.

[11]  John W. McDonough,et al.  Kalman Filters for Time Delay of Arrival-Based Source Localization , 2005, EURASIP J. Adv. Signal Process..

[12]  Michael Shapiro Brandstein,et al.  A framework for speech source localization using sensor arrays , 1995 .

[13]  Rainer Stiefelhagen,et al.  Towards vision-based 3-D people tracking in a smart room , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[14]  Michael S. Brandstein,et al.  A closed-form location estimator for use with room environment microphone arrays , 1997, IEEE Trans. Speech Audio Process..

[15]  Alex Pentland,et al.  Pfinder: Real-Time Tracking of the Human Body , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Jacob Benesty,et al.  Robust time delay estimation exploiting redundancy among multiple microphones , 2003, IEEE Trans. Speech Audio Process..

[17]  Michael Isard,et al.  CONDENSATION—Conditional Density Propagation for Visual Tracking , 1998, International Journal of Computer Vision.

[18]  W. Eric L. Grimson,et al.  Adaptive background mixture models for real-time tracking , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[19]  Julius O. Smith,et al.  Closed-form least-squares source location estimation from range-difference measurements , 1987, IEEE Trans. Acoust. Speech Signal Process..

[20]  H.K. Ekenel,et al.  Kalman filters for audio-video source localization , 2005, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005..

[21]  Patrick Pérez,et al.  Color-Based Probabilistic Tracking , 2002, ECCV.

[22]  Trevor Darrell,et al.  A Probabilistic Framework for Multi-modal Multi-Person Tracking , 2003, 2003 Conference on Computer Vision and Pattern Recognition Workshop.

[23]  K. C. Ho,et al.  A simple and efficient estimator for hyperbolic location , 1994, IEEE Trans. Signal Process..

[24]  Jacob Benesty,et al.  Real-time passive source localization: a practical linear-correction least-squares approach , 2001, IEEE Trans. Speech Audio Process..

[25]  B. Schiele,et al.  Fast and Robust Face Finding via Local Context , 2003 .

[26]  Patrick Pérez,et al.  Sequential Monte Carlo Fusion of Sound and Vision for Speaker Tracking , 2001, ICCV.

[27]  Jean-Marc Odobez,et al.  A Mixed-State I-Particle Filter for Multi-Camera Speaker Tracking , 2003, ICCV 2003.