论文信息 - Multimodal Tracking for Smart Videoconferencing and Video Surveillance

Multimodal Tracking for Smart Videoconferencing and Video Surveillance

Many applications require the ability to track the 3-D motion of the subjects. We build a particle filter based framework for multimodal tracking using multiple cameras and multiple microphone arrays. In order to calibrate the resulting system, we propose a method to determine the locations of all microphones using at least five loudspeakers and under assumption that for each loudspeaker there exists a microphone very close to it. We derive the maximum likelihood (ML) estimator, which reduces to the solution of the non-linear least squares problem. We verify the correctness and robustness of the multimodal tracker and of the self-calibration algorithm both with Monte-Carlo simulations and on real data from three experimental setups.

[1] Ramani Duraiswami,et al. Automatic position calibration of multiple microphones , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2] Michael S. Brandstein,et al. Real-Time Automated Video and Audio Capture with Multiple Cameras and Microphones , 2001, J. VLSI Signal Process..

[3] H. V. Trees. Detection, Estimation, And Modulation Theory , 2001 .

[4] Rama Chellappa,et al. Estimating the Kinematics and Structure of a Rigid Object from a Sequence of Monocular Images , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[5] Songde Ma,et al. Implicit and Explicit Camera Calibration: Theory and Experiments , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[6] Michael S. Brandstein,et al. A robust method for speech signal time-delay estimation in reverberant rooms , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7] Bernhard P. Wrobel,et al. Multiple View Geometry in Computer Vision , 2001 .

[8] Michael S. Brandstein,et al. Robust Localization in Reverberant Rooms , 2001, Microphone Arrays.

[9] Vishu R. Viswanathan,et al. Hands-free voice communication in an automobile with a microphone array , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10] Michael Isard,et al. CONDENSATION—Conditional Density Propagation for Visual Tracking , 1998, International Journal of Computer Vision.

[11] Maurizio Omologo,et al. Use of the crosspower-spectrum phase in acoustic event location , 1997, IEEE Trans. Speech Audio Process..

[12] Rama Chellappa,et al. Structure from Motion Using Sequential Monte Carlo Methods , 2004, International Journal of Computer Vision.

[13] P. Fearnhead,et al. An improved particle filter for non-linear problems , 1999 .

[14] Andrew Zisserman,et al. Metric rectification for perspective images of planes , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[15] Larry S. Davis,et al. Quasi-Random Sampling for Condensation , 2000, ECCV.

[16] Timothy J. Robinson,et al. Sequential Monte Carlo Methods in Practice , 2003 .

[17] Anthony J. Weiss,et al. Array shape calibration using sources in unknown locations-a maximum likelihood approach , 1989, IEEE Trans. Acoust. Speech Signal Process..

[18] Gopal Sarma Pingali,et al. Audio-visual tracking for natural interactivity , 1999, MULTIMEDIA '99.

[19] Rama Chellappa,et al. Simultaneous tracking and verification via sequential posterior estimation , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[20] Maurizio Omologo,et al. Speech Recognition with Microphone Arrays , 2001, Microphone Arrays.

[21] F. A. Seiler,et al. Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[22] Jeffrey A. Fessler. Mean and variance of implicitly defined biased estimators (such as penalized maximum likelihood): applications to tomography , 1996, IEEE Trans. Image Process..

[23] S. P. Mudur,et al. Three-dimensional computer vision: a geometric viewpoint , 1993 .

[24] Peter M. Schultheiss,et al. Array shape calibration using sources in unknown locations-Part II: Near-field sources and estimator implementation , 1987, IEEE Trans. Acoust. Speech Signal Process..

[25] Larry S. Davis,et al. An audio-video front-end for multimedia applications , 2000, Smc 2000 conference proceedings. 2000 ieee international conference on systems, man and cybernetics. 'cybernetics evolving to systems, humans, organizations, and their complex interactions' (cat. no.0.

[26] Zhengyou Zhang,et al. A Flexible New Technique for Camera Calibration , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[27] Bill Triggs,et al. Autocalibration from Planar Scenes , 1998, ECCV.

[28] G. Carter,et al. The generalized correlation method for estimation of time delay , 1976 .

[29] Francis K. H. Quek,et al. Gesture, speech, and gaze cues for discourse segmentation , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[30] Michael Isard,et al. ICONDENSATION: Unifying Low-Level and High-Level Tracking in a Stochastic Framework , 1998, ECCV.

[31] Hong Wang,et al. Voice source localization for automatic camera pointing system in videoconferencing , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[32] Paul J. Laurienti,et al. Neural mechanisms for integrating information from multiple senses , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[33] B. C. Ng,et al. Sensor-array calibration using a maximum-likelihood approach , 1996 .

[34] K. C. Cheok,et al. A multisensor-based collision avoidance system with application to a military HMMWV , 2000, ITSC2000. 2000 IEEE Intelligent Transportation Systems. Proceedings (Cat. No.00TH8493).

[35] Larry S. Davis,et al. Multimodal tracking for smart videoconferencing , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[36] Ramani Duraiswami,et al. Fast head-related transfer function measurement via reciprocity. , 2006, The Journal of the Acoustical Society of America.

[37] Philip E. Gill,et al. Practical optimization , 1981 .

[38] Jiri Matas,et al. Audio-visual person verification , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[39] Vladimir Pavlovic,et al. Multimodal speaker detection using error feedback dynamic Bayesian networks , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[40] Juan José Villacorta Calvo,et al. Surveillance system based on data fusion from image and acoustic array sensors , 2000 .

[41] William H. Press,et al. The Art of Scientific Computing Second Edition , 1998 .

[42] Ramani Duraiswami,et al. Multimodal localization of a flying bat , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[43] Rama Chellappa,et al. Stochastic Approximation and Rate-Distortion Analysis for Robust Structure and Motion Estimation , 2003, International Journal of Computer Vision.

[44] Rainer Lienhart,et al. Position calibration of microphones and loudspeakers in distributed computing platforms , 2005, IEEE Transactions on Speech and Audio Processing.

[45] Thomas L. Marzetta,et al. Detection, Estimation, and Modulation Theory , 1976 .

[46] Larry S. Davis,et al. Smart videoconferencing , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).