Dynamic context capture and distributed video arrays for intelligent spaces

Intelligent environments can be viewed as systems where humans and machines (rooms) collaborate. Intelligent (or smart) environments need to extract and maintain an awareness of a wide range of events and human activities occurring in these spaces. This requirement is crucial for supporting efficient and effective interactions among humans as well as humans and intelligent spaces. Visual information plays an important role for developing accurate and useful representation of the static and dynamic states of an intelligent environment. Accurate and efficient capture, analysis, and summarization of the dynamic context requires the vision system to work at multiple levels of semantic abstractions in a robust manner. In this paper, we present details of a long-term and ongoing research project, where indoor intelligent spaces endowed with a range of useful functionalities are designed, built, and systematically evaluated. Some of the key functionalities include: intruder detection; multiple person tracking; body pose and posture analysis; person identification; human body modeling and movement analysis; and for integrated systems for intelligent meeting rooms, teleconferencing, or performance spaces. The paper includes an overall system architecture to support design and development of intelligent environments. Details of panoramic (omnidirectional) video camera arrays, calibration, video stream synchronization, and real-time capture/processing are discussed. Modules for multicamera-based multiperson tracking, event detection and event based servoing for selective attention, voxelization, streaming face recognition, are also discussed. The paper includes experimental studies to systematically evaluate performance of individual video analysis modules as well as to evaluate basic feasibility of an integrated system for dynamic context capture and event based servoing, and semantic information summarization.

[1]  Mohan M. Trivedi,et al.  Intelligent environments and active camera networks , 2000, Smc 2000 conference proceedings. 2000 ieee international conference on systems, man and cybernetics. 'cybernetics evolving to systems, humans, organizations, and their complex interactions' (cat. no.0.

[2]  Mohan M. Trivedi,et al.  N-Ocular stereo for real-time human tracking , 2001 .

[3]  Erik Hjelmås,et al.  Face Detection: A Survey , 2001, Comput. Vis. Image Underst..

[4]  Thomas B. Moeslund,et al.  A Survey of Computer Vision-Based Human Motion Capture , 2001, Comput. Vis. Image Underst..

[5]  Vladimir Pavlovic,et al.  Integration of audio/visual information for use in human-computer intelligent interaction , 1997, Proceedings of International Conference on Image Processing.

[6]  Finn V. Jensen,et al.  Bayesian Networks and Decision Graphs , 2001, Statistics for Engineering and Information Science.

[7]  Richard A. Bolt,et al.  “Put-that-there”: Voice and gesture at the graphics interface , 1980, SIGGRAPH '80.

[8]  Ephraim P. Glinert,et al.  Multimodal Integration , 1996, IEEE Multim..

[9]  Dariu Gavrila,et al.  The Visual Analysis of Human Movement: A Survey , 1999, Comput. Vis. Image Underst..

[10]  Takeo Kanade,et al.  A real time system for robust 3D voxel reconstruction of human motions , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[11]  Michael Shapiro Brandstein,et al.  A framework for speech source localization using sensor arrays , 1995 .

[12]  Tarak Gandhi,et al.  Driver's view and vehicle surround estimation using omnidirectional video stream , 2003, IEEE IV2003 Intelligent Vehicles Symposium. Proceedings (Cat. No.03TH8683).

[13]  Alex Pentland,et al.  Face Recognition for Smart Environments , 2000, Computer.

[14]  Mohan M. Trivedi,et al.  Robust real-time detection, tracking, and pose estimation of faces in video streams , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[15]  Stanley T. Birchfield,et al.  Elliptical head tracking using intensity gradients and color histograms , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[16]  Larry S. Davis,et al.  Multi-perspective analysis of human action , 1999 .

[17]  B. V. K. Vijaya Kumar,et al.  Spatial frequency domain image processing for biometric recognition , 2002, Proceedings. International Conference on Image Processing.

[18]  Alex Pentland,et al.  Computer Vision for Human–Machine Interaction: Acknowledgements , 1998 .

[19]  Vladimir Pavlovic,et al.  Visual Interpretation of Hand Gestures for Human-Computer Interaction: A Review , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  Mohan M. Trivedi,et al.  Source localization in reverberant environments: modeling and statistical analysis , 2003, IEEE Trans. Speech Audio Process..

[21]  Mohan M. Trivedi,et al.  Streaming face recognition using multicamera video arrays , 2002, Object recognition supported by user interaction for service robots.

[22]  Shree K. Nayar,et al.  Catadioptric omnidirectional camera , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[23]  Rama Chellappa,et al.  Human and machine recognition of faces: a survey , 1995, Proc. IEEE.

[24]  Mohan M. Trivedi,et al.  Camera networks and microphone arrays for video conferencing , 1999, Optics East.

[25]  Alex Pentland,et al.  A Bayesian Computer Vision System for Modeling Human Interactions , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  Alex Pentland,et al.  Towards Measuring Human Interactions in Conversational Settings , 2001 .

[27]  Truong Q. Nguyen,et al.  An image-based Bayesian framework for face detection , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[28]  Mohan M. Trivedi,et al.  Active Camera Networks and Semantic Event Databases for Intelligent Environments , 2002 .

[29]  Barry Brumitt,et al.  EasyLiving: Technologies for Intelligent Environments , 2000, HUC.

[30]  Mohan M. Trivedi,et al.  Distributed interactive video arrays for event based analysis of incidents , 2002, Proceedings. The IEEE 5th International Conference on Intelligent Transportation Systems.

[31]  Mohan M. Trivedi,et al.  Human Body Model Acquisition and Tracking Using Voxel Data , 2003, International Journal of Computer Vision.

[32]  Roger Y. Tsai,et al.  A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the-shelf TV cameras and lenses , 1987, IEEE J. Robotics Autom..

[33]  Robert T. Collins,et al.  Calibration of an outdoor active camera system , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[34]  Shaogang Gong,et al.  Multi-view face detection and pose estimation using a composite support vector machine across the view sphere , 1999, Proceedings International Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems. In Conjunction with ICCV'99 (Cat. No.PR00378).

[35]  Alex Pentland,et al.  Probabilistic Visual Learning for Object Representation , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[36]  Vladimir Pavlovic,et al.  Toward multimodal human-computer interface , 1998, Proc. IEEE.

[37]  Mohan M. Trivedi,et al.  Analysis of time-delay estimation in reverberant environments , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[38]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[39]  Larry S. Davis,et al.  W4: Real-Time Surveillance of People and Their Activities , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[40]  Mohan M. Trivedi,et al.  Video arrays for real-time tracking of person, head, and face in an intelligent room , 2003, Machine Vision and Applications.