Audio-visual multi-person tracking and identification for smart environments

This paper presents a novel system for the automatic and unobtrusive tracking and identification of multiple persons in an indoor environment. Information from several fixed cameras is fused in a particle filter framework to simultaneously track multiple occupants. A set of steerable fuzzy-controlled pan-tilt-zoom cameras serves to smoothly track persons of interest and opportunistically capture facial close-ups for face identification. In parallel, speech segmentation, sound source localization and speaker identification are performed using several far-field microphones and arrays. The information coming asynchronously and sporadically from several sources, such as track updates and spatio-temporally localized visual and acoustic identification cues, is fused at higher level to gradually refine the global scene model and increase the system's confidence in the set of recognized identities. The system has been trained on a small set of users' faces and/or voices and showed good performance in natural meeting scenarios at quickly acquiring their identities and complementing the ID information missing in single modalities.

[1]  T. Kailath The Divergence and Bhattacharyya Distance Measures in Signal Selection , 1967 .

[2]  Rainer Lienhart,et al.  An extended set of Haar-like features for rapid object detection , 2002, Proceedings. International Conference on Image Processing.

[3]  Toru Yamaguchi,et al.  A camera control based on fuzzy behavior recognition of lecturer for distance lecture , 2001, 10th IEEE International Conference on Fuzzy Systems. (Cat. No.01CH37297).

[4]  Rainer Stiefelhagen,et al.  A GENERIC FACE REPRESENTATION APPROACH FOR LOCAL APPEARANCE BASED FACE VERIFICATION , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Workshops.

[5]  Irfan Essa,et al.  A System for Tracking and Recognizing Multiple People with Multiple Cameras , 1998 .

[6]  Alexander H. Waibel CHIL - Computers in the Human Interaction Loop , 2005, MVA.

[7]  Jean-Yves Bouguet,et al.  Camera calibration toolbox for matlab , 2001 .

[8]  Rainer Stiefelhagen,et al.  Multiple Object Tracking Performance Metrics and Evaluation in a Smart Room Environment , 2006 .

[9]  E. Cox,et al.  Fuzzy fundamentals , 1992, IEEE Spectrum.

[10]  Alexander H. Waibel,et al.  Multimodal people ID for a multimedia meeting browser , 1999, MULTIMEDIA '99.

[11]  Rainer Stiefelhagen,et al.  Multi-and Single View Multiperson Tracking for Smart Room Environments , 2006, CLEAR.

[12]  Helder Araújo,et al.  A surveillance system combining peripheral and foveated motion tracking , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[13]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Rainer Stiefelhagen,et al.  Pointing gesture recognition based on 3D-tracking of face, hands and head orientation , 2003, ICMI '03.

[15]  Rainer Stiefelhagen,et al.  Tracking focus of attention in meetings , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[16]  Irfan Essa,et al.  Towards reliable multimodal sensing in aware environments , 2001, PUI '01.

[17]  Qin Jin,et al.  ISL Person Identification Systems in the CLEAR Evaluations , 2006, CLEAR.

[18]  Roger Y. Tsai,et al.  A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the-shelf TV cameras and lenses , 1987, IEEE J. Robotics Autom..

[19]  Tanzeem Choudhury,et al.  Multimodal person recognition using unconstrained audio and video , 1998 .

[20]  Samy Bengio,et al.  Automatic analysis of multimodal group actions in meetings , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Rainer Stiefelhagen,et al.  Automatic Person Detection and Tracking using Fuzzy Controlled Active Cameras , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Mohan M. Trivedi,et al.  Active Camera Networks and Semantic Event Databases for Intelligent Environments , 2002 .

[23]  J. Munkres ALGORITHMS FOR THE ASSIGNMENT AND TRANSIORTATION tROBLEMS* , 1957 .

[24]  Sharath Pankanti,et al.  Face cataloger: multi-scale imaging for relating identity to location , 2003, Proceedings of the IEEE Conference on Advanced Video and Signal Based Surveillance, 2003..

[25]  Rainer Stiefelhagen,et al.  Local appearance based face recognition using discrete cosine transform , 2005, 2005 13th European Signal Processing Conference.

[26]  H.K. Ekenel,et al.  Kalman filters for audio-video source localization , 2005, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005..

[27]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.