Multimodal Tracking for Smart Videoconferencing and Video Surveillance

Many applications require the ability to track the 3-D motion of the subjects. We build a particle filter based framework for multimodal tracking using multiple cameras and multiple microphone arrays. In order to calibrate the resulting system, we propose a method to determine the locations of all microphones using at least five loudspeakers and under assumption that for each loudspeaker there exists a microphone very close to it. We derive the maximum likelihood (ML) estimator, which reduces to the solution of the non-linear least squares problem. We verify the correctness and robustness of the multimodal tracker and of the self-calibration algorithm both with Monte-Carlo simulations and on real data from three experimental setups.

[1]  Ramani Duraiswami,et al.  Automatic position calibration of multiple microphones , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Michael S. Brandstein,et al.  Real-Time Automated Video and Audio Capture with Multiple Cameras and Microphones , 2001, J. VLSI Signal Process..

[3]  H. V. Trees Detection, Estimation, And Modulation Theory , 2001 .

[4]  Rama Chellappa,et al.  Estimating the Kinematics and Structure of a Rigid Object from a Sequence of Monocular Images , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Songde Ma,et al.  Implicit and Explicit Camera Calibration: Theory and Experiments , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Michael S. Brandstein,et al.  A robust method for speech signal time-delay estimation in reverberant rooms , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[8]  Michael S. Brandstein,et al.  Robust Localization in Reverberant Rooms , 2001, Microphone Arrays.

[9]  Vishu R. Viswanathan,et al.  Hands-free voice communication in an automobile with a microphone array , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Michael Isard,et al.  CONDENSATION—Conditional Density Propagation for Visual Tracking , 1998, International Journal of Computer Vision.

[11]  Maurizio Omologo,et al.  Use of the crosspower-spectrum phase in acoustic event location , 1997, IEEE Trans. Speech Audio Process..

[12]  Rama Chellappa,et al.  Structure from Motion Using Sequential Monte Carlo Methods , 2004, International Journal of Computer Vision.

[13]  P. Fearnhead,et al.  An improved particle filter for non-linear problems , 1999 .

[14]  Andrew Zisserman,et al.  Metric rectification for perspective images of planes , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[15]  Larry S. Davis,et al.  Quasi-Random Sampling for Condensation , 2000, ECCV.

[16]  Timothy J. Robinson,et al.  Sequential Monte Carlo Methods in Practice , 2003 .

[17]  Anthony J. Weiss,et al.  Array shape calibration using sources in unknown locations-a maximum likelihood approach , 1989, IEEE Trans. Acoust. Speech Signal Process..

[18]  Gopal Sarma Pingali,et al.  Audio-visual tracking for natural interactivity , 1999, MULTIMEDIA '99.

[19]  Rama Chellappa,et al.  Simultaneous tracking and verification via sequential posterior estimation , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[20]  Maurizio Omologo,et al.  Speech Recognition with Microphone Arrays , 2001, Microphone Arrays.

[21]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[22]  Jeffrey A. Fessler Mean and variance of implicitly defined biased estimators (such as penalized maximum likelihood): applications to tomography , 1996, IEEE Trans. Image Process..

[23]  S. P. Mudur,et al.  Three-dimensional computer vision: a geometric viewpoint , 1993 .

[24]  Peter M. Schultheiss,et al.  Array shape calibration using sources in unknown locations-Part II: Near-field sources and estimator implementation , 1987, IEEE Trans. Acoust. Speech Signal Process..

[25]  Larry S. Davis,et al.  An audio-video front-end for multimedia applications , 2000, Smc 2000 conference proceedings. 2000 ieee international conference on systems, man and cybernetics. 'cybernetics evolving to systems, humans, organizations, and their complex interactions' (cat. no.0.

[26]  Zhengyou Zhang,et al.  A Flexible New Technique for Camera Calibration , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  Bill Triggs,et al.  Autocalibration from Planar Scenes , 1998, ECCV.

[28]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[29]  Francis K. H. Quek,et al.  Gesture, speech, and gaze cues for discourse segmentation , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[30]  Michael Isard,et al.  ICONDENSATION: Unifying Low-Level and High-Level Tracking in a Stochastic Framework , 1998, ECCV.

[31]  Hong Wang,et al.  Voice source localization for automatic camera pointing system in videoconferencing , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[32]  Paul J. Laurienti,et al.  Neural mechanisms for integrating information from multiple senses , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[33]  B. C. Ng,et al.  Sensor-array calibration using a maximum-likelihood approach , 1996 .

[34]  K. C. Cheok,et al.  A multisensor-based collision avoidance system with application to a military HMMWV , 2000, ITSC2000. 2000 IEEE Intelligent Transportation Systems. Proceedings (Cat. No.00TH8493).

[35]  Larry S. Davis,et al.  Multimodal tracking for smart videoconferencing , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[36]  Ramani Duraiswami,et al.  Fast head-related transfer function measurement via reciprocity. , 2006, The Journal of the Acoustical Society of America.

[37]  Philip E. Gill,et al.  Practical optimization , 1981 .

[38]  Jiri Matas,et al.  Audio-visual person verification , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[39]  Vladimir Pavlovic,et al.  Multimodal speaker detection using error feedback dynamic Bayesian networks , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[40]  Juan José Villacorta Calvo,et al.  Surveillance system based on data fusion from image and acoustic array sensors , 2000 .

[41]  William H. Press,et al.  The Art of Scientific Computing Second Edition , 1998 .

[42]  Ramani Duraiswami,et al.  Multimodal localization of a flying bat , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[43]  Rama Chellappa,et al.  Stochastic Approximation and Rate-Distortion Analysis for Robust Structure and Motion Estimation , 2003, International Journal of Computer Vision.

[44]  Rainer Lienhart,et al.  Position calibration of microphones and loudspeakers in distributed computing platforms , 2005, IEEE Transactions on Speech and Audio Processing.

[45]  Thomas L. Marzetta,et al.  Detection, Estimation, and Modulation Theory , 1976 .

[46]  Larry S. Davis,et al.  Smart videoconferencing , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).