Multimedia sensor fusion for intelligent camera control

A multisensor-based control system for an active pan/tilt/zoom camera is presented. Acoustic and visual information from multimedia sensors is used to locate the person currently speaking and track people moving about in a room. Pixel-level fusion of skin color with an image produced from interaural sound delay provides a simple means of detecting the face of the current speaker. For wider-scale surveillance tasks, moving targets are detected using color image differencing. Target data is fed to a behavior-based fuzzy control system which uses expert rules to aim the camera. Applications include video-conferencing, security, surveillance, and advances in human-computer interaction. The system has been implemented in on a multimedia PC equipped with a wide angle camera, a Canon VC-CI pan/tilt/zoom camera, and two microphones.

[1]  A. Weiss,et al.  Fundamental limitations in passive time delay estimation--Part I: Narrow-band systems , 1983 .

[2]  Alexander H. Waibel,et al.  Toward movement-invariant automatic lip-reading and speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[3]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[4]  Scott A. Nichols,et al.  Reliable motion detection of small targets in video with low signal-to-clutter ratios , 1995, Proceedings The Institute of Electrical and Electronics Engineers. 29th Annual 1995 International Carnahan Conference on Security Technology.

[5]  Alexander H. Waibel,et al.  Knowing who to listen to in speech recognition: visually guided beamforming , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[6]  A. E. Pearson,et al.  On time delay estimation involving received signals , 1984 .

[7]  Alex Waibel,et al.  Face locating and tracking for human-computer interaction , 1994, Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers.

[8]  Man-Wai Mak,et al.  Lip-motion analysis for speech segmentation in noise , 1994, Speech Commun..

[9]  Hugh F. Durrant-Whyte,et al.  A Fully Decentralized Multi-Sensor System For Tracking and Surveillance , 1993, Int. J. Robotics Res..

[10]  Jie Huang,et al.  A biomimetic system for localization and separation of multiple sound sources , 1994 .

[11]  Ren C. Luo,et al.  Multi-layered fuzzy behavior fusion for real-time control of systems with many sensors , 1994, Proceedings of 1994 IEEE International Conference on MFI '94. Multisensor Fusion and Integration for Intelligent Systems.

[12]  Harvey F. Silverman,et al.  A two-stage algorithm for determining talker location from linear microphone array data , 1992 .

[13]  Anup Basu,et al.  Motion Tracking with an Active Camera , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  H. Yamasaki,et al.  Audio-visual sensor fusion system for intelligent sound sensing , 1994, Proceedings of 1994 IEEE International Conference on MFI '94. Multisensor Fusion and Integration for Intelligent Systems.

[15]  Y. Chan,et al.  The least squares estimation of time delay and its use in signal detection , 1978 .

[16]  Ren C. Luo,et al.  Multilayered fuzzy behavior fusion for real-time reactive control of systems with multiple sensors , 1996, IEEE Trans. Ind. Electron..