Semantic saliency driven camera control for personal remote collaboration

This paper presents a camera combo system for personal remote collaboration applications. The system consists of two different cameras. One camera has a wide field of view, and the other can pan/tilt/zoom (PTZ) based on analysis of the images captured by the wide angle camera. Unlike traditional approaches which usually drive the PTZ camera to follow the person or his/her head, our system is capable of capturing general objects of interest in remote collaboration. For instance, when the user raises something trying to show it to the remote person, our system will automatically position the PTZ camera to zoom in at the object. At the core of our system is a semantic saliency map that overcomes many limitations of low-level saliency maps computed from preliminary image features. We demonstrate how such a semantic saliency map can be computed through contextual analysis, sign analysis and transitional analysis, and how it can be used for PTZ camera control with a novel information loss optimization based virtual director. The effectiveness of the proposed method is demonstrated with real-world sequences.

[1]  Paul A. Viola,et al.  Multiple-Instance Pruning For Learning Efficient Cascade Detectors , 2007, NIPS.

[2]  W. Eric L. Grimson,et al.  Adaptive background mixture models for real-time tracking , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[3]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[4]  Erik Hjelmås,et al.  Face Detection: A Survey , 2001, Comput. Vis. Image Underst..

[5]  Sharath Pankanti,et al.  Face cataloger: multi-scale imaging for relating identity to location , 2003, Proceedings of the IEEE Conference on Advanced Video and Signal Based Surveillance, 2003..

[6]  Nanning Zheng,et al.  Learning to Detect a Salient Object , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Michael Harville,et al.  A Framework for High-Level Feedback to Adaptive, Per-Pixel, Mixture-of-Gaussian Background Models , 2002, ECCV.

[8]  Kenneth Turkowski,et al.  Creating image-based VR using a self-calibrating fisheye lens , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[9]  Alex Pentland,et al.  Pfinder: Real-Time Tracking of the Human Body , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Ying-li Tian,et al.  Robust Salient Motion Detection with Complex Background for Real-Time Video Surveillance , 2005, 2005 Seventh IEEE Workshops on Applications of Computer Vision (WACV/MOTION'05) - Volume 1.

[11]  Martial Hebert,et al.  Efficient visual event detection using volumetric features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[12]  John S. Boreczky,et al.  FlySPEC: a multi-user video camera system with hybrid human and automatic control , 2002, MULTIMEDIA '02.

[13]  Marc Pollefeys,et al.  Towards calibrating a pan-tilt-zoom camera network , 2004 .

[14]  Chong-Wah Ngo,et al.  Gesture tracking and recognition for lecture video editing , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[15]  Richard I. Hartley,et al.  Self-Calibration of Stationary Cameras , 1997, International Journal of Computer Vision.

[16]  Larry S. Davis,et al.  Non-parametric Model for Background Subtraction , 2000, ECCV.

[17]  Dorin Comaniciu,et al.  Kernel-Based Object Tracking , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  Ulrich Amsel,et al.  The History of the Photographic Lens , 1922, Nature.

[19]  Paul A. Viola,et al.  Boosting-Based Multimodal Speaker Detection for Distributed Meetings , 2006, 2006 IEEE Workshop on Multimedia Signal Processing.

[20]  Richard P. Wildes A measure of motion salience for surveillance applications , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[21]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[22]  Anoop Gupta,et al.  Automating lecture capture and broadcast: technology and videography , 2004, Multimedia Systems.