Joint face and head tracking inside multi-camera smart rooms

The paper introduces a novel detection and tracking system that provides both frame-view and world-coordinate human location information, based on video from multiple synchronized and calibrated cameras with overlapping fields of view. The system is developed and evaluated for the specific scenario of a seminar lecturer presenting in front of an audience inside a “smart room”, its aim being to track the lecturer’s head centroid in the three-dimensional (3D) space and also yield two-dimensional (2D) face information in the available camera views. The proposed approach is primarily based on a statistical appearance model of human faces by means of well-known AdaBoost-like face detectors, extended to address the head pose variation observed in the smart room scenario of interest. The appearance module is complemented by two novel components and assisted by a simple tracking drift detection mechanism. The first component of interest is the initialization module, which employs a spatio-temporal dynamic programming approach with appropriate penalty functions to obtain optimal 3D location hypotheses. The second is an adaptive subspace learning based 2D tracking scheme with a novel forgetting mechanism, introduced to reduce tracking drift and increase robustness. System performance is benchmarked on an extensive database of realistic human interaction in the lecture smart room scenario, collected as part of the European integrated project “CHIL”. The system consistently achieves excellent tracking precision, with a 3D mean tracking error of less than 16 cm, and is demonstrated to outperform four alternative tracking schemes. Furthermore, the proposed system performs relatively well in detecting frontal and near-frontal faces in the available frame views.

[1]  Alex Pentland,et al.  View-based and modular eigenspaces for face recognition , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Josef Kittler,et al.  Floating search methods in feature selection , 1994, Pattern Recognit. Lett..

[3]  Claudio S. Pinhanez,et al.  Intelligent Studios: Using Computer Vision to Control TV Cameras , 1995 .

[4]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[5]  Michael Isard,et al.  Contour Tracking by Stochastic Propagation of Conditional Density , 1996, ECCV.

[6]  James W. Davis,et al.  The Representation and Recognition of Action Using Temporal Templates , 1997, CVPR 1997.

[7]  Hans Peter Graf,et al.  Robust recognition of faces and facial features with a multi-modal system , 1997, 1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and Simulation.

[8]  Federico Girosi,et al.  Training support vector machines: an application to face detection , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[9]  Takeo Kanade,et al.  Neural Network-Based Face Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Timothy F. Cootes,et al.  Active Appearance Models , 1998, ECCV.

[11]  Jitendra Malik,et al.  Tracking people with twists and exponential maps , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[12]  Narendra Ahuja,et al.  A SNoW-Based Face Detector , 1999, NIPS.

[13]  Hironobu Fujiyoshi,et al.  A System for Video Surveillance and Monitoring CMU VSAM Final Report , 1999 .

[14]  Dariu Gavrila,et al.  The Visual Analysis of Human Movement: A Survey , 1999, Comput. Vis. Image Underst..

[15]  Rómer Rosales,et al.  3D trajectory recovery for tracking multiple objects and trajectory guided recognition of actions , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[16]  Daniel P. Huttenlocher,et al.  Adaptive Bayesian recognition in tracking rigid objects , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[17]  W. Eric L. Grimson,et al.  Learning Patterns of Activity Using Real-Time Tracking , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  Ralph R. Martin,et al.  Merging and Splitting Eigenspace Models , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  Dorin Comaniciu,et al.  Real-time tracking of non-rigid objects using mean shift , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[20]  Timothy F. Cootes,et al.  Coupled-View Active Appearance Models , 2000, BMVC.

[21]  Takeo Kanade,et al.  A statistical method for 3D object detection applied to faces and cars , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[22]  Hai Tao,et al.  Dynamic layer representation with applications to tracking , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[23]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[24]  Paul A. Viola,et al.  Robust Real-time Object Detection , 2001 .

[25]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  Michael Isard,et al.  BraMBLe: a Bayesian multiple-blob tracker , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[27]  T. Başar,et al.  A New Approach to Linear Filtering and Prediction Problems , 2001 .

[28]  Neil J. Gordon,et al.  A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking , 2002, IEEE Trans. Signal Process..

[29]  Larry S. Davis,et al.  M2Tracker: A Multi-view Approach to Segmenting and Tracking People in a Cluttered Scene Using Region-Based Stereo , 2002, ECCV.

[30]  Larry S. Davis,et al.  Joint Audio-Visual Tracking Using Particle Filters , 2002, EURASIP J. Adv. Signal Process..

[31]  T. Kanade,et al.  A master-slave system to acquire biometric imagery of humans at distance , 2003, IWVS '03.

[32]  Andrew Zisserman,et al.  Multiple View Geometry in Computer Vision (2nd ed) , 2003 .

[33]  David J. Fleet,et al.  Robust Online Appearance Models for Visual Tracking , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[34]  Sharath Pankanti,et al.  Face cataloger: multi-scale imaging for relating identity to location , 2003, Proceedings of the IEEE Conference on Advanced Video and Signal Based Surveillance, 2003..

[35]  A. Senior Real-time articulated human body tracking using silhouette information , 2003 .

[36]  Ming-Hsuan Yang,et al.  Incremental Learning for Visual Tracking , 2004, NIPS.

[37]  Michael Isard,et al.  CONDENSATION—Conditional Density Propagation for Visual Tracking , 1998, International Journal of Computer Vision.

[38]  Yong Rui,et al.  A portable solution for automatic lecture room camera management , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[39]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[40]  Paul A. Viola,et al.  Boosting Image Retrieval , 2004, International Journal of Computer Vision.

[41]  Stan Z. Li,et al.  FloatBoost learning and statistical face detection , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  David J. Kriegman,et al.  Visual tracking using learned linear subspaces , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[43]  Sharath Pankanti,et al.  Detection and tracking in the IBM PeopleVision system , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[44]  Michael J. Black,et al.  EigenTracking: Robust Matching and Tracking of Articulated Objects Using a View-Based Representation , 1996, International Journal of Computer Vision.

[45]  Bohyung Han,et al.  On-line density-based appearance modeling for object tracking , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[46]  A. Hampapur,et al.  Smart video surveillance: exploring the concept of multiscale spatiotemporal tracking , 2005, IEEE Signal Processing Magazine.

[47]  John W. McDonough,et al.  Microphone Array Driven Speech Recognition: Influence of Localization on the Word Error Rate , 2005, MLMI.

[48]  Thomas S. Huang,et al.  A Joint System for Person Tracking and Face Detection , 2005, ICCV-HCI.

[49]  Yanxi Liu,et al.  Online selection of discriminative tracking features , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Sharath Pankanti,et al.  Smart Video Surveillance , 2005 .

[51]  John W. McDonough,et al.  A joint particle filter for audio-visual speaker tracking , 2005, ICMI '05.

[52]  Ramakant Nevatia,et al.  Speaker Tracking in Seminars by Human Body Detection , 2006, CLEAR.

[53]  Oswald Lanz,et al.  Approximate Bayesian multibody tracking , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  Rainer Stiefelhagen,et al.  Multiple Object Tracking Performance Metrics and Evaluation in a Smart Room Environment , 2006 .

[55]  John S. Garofolo,et al.  The Rich Transcription 2006 Spring Meeting Recognition Evaluation , 2006, Machine Learning for Multimodal Interaction.

[56]  Gerasimos Potamianos,et al.  Audio-Visual ASR from Multiple Views inside Smart Rooms , 2006, 2006 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems.

[57]  A. Senior,et al.  A Comparison of Multicamera Person-Tracking Algorithms , 2006 .

[58]  Aristodemos Pnevmatikakis,et al.  A Decision Fusion System Across Time and Classifiers for Audio-Visual Person Identification , 2006, CLEAR.

[59]  Roberto Brunelli,et al.  A Generative Approach to Audio-Visual Person Tracking , 2006, CLEAR.

[60]  John W. McDonough,et al.  An Audio-Visual Particle Filter for Speaker Tracking on the CLEAR'06 Evaluation Dataset , 2006, CLEAR.

[61]  Rainer Stiefelhagen,et al.  The CLEAR 2006 Evaluation , 2006, CLEAR.

[62]  Jake K. Aggarwal,et al.  Object tracking in an outdoor environment using fusion of features and cameras , 2006, Image Vis. Comput..

[63]  Rainer Stiefelhagen,et al.  Multi-and Single View Multiperson Tracking for Smart Room Environments , 2006, CLEAR.

[64]  Montse Pardàs,et al.  UPC Audio, Video and Multimodal Person Tracking Systems in the Clear Evaluation Campaign , 2006, CLEAR.

[65]  Aristodemos Pnevmatikakis,et al.  3D Audiovisual Person Tracking Using Kalman Filtering and Information Theory , 2006, CLEAR.

[66]  Michael C. Nechyba,et al.  PittPatt Face Detection and Tracking for the CLEAR 2006 Evaluation , 2006, CLEAR.

[67]  Aristodemos Pnevmatikakis,et al.  2D Person Tracking Using Kalman Filtering and Adaptive Background Learning in a Feedback Loop , 2006, CLEAR.

[68]  Tim J. Ellis,et al.  Multi camera image tracking , 2006, Image Vis. Comput..

[69]  Rainer Stiefelhagen,et al.  Multimodal Technologies for Perception of Humans: First International Evaluation Workshop on Classification of Events, Activities and Relationships, CLEAR ... Papers (Lecture Notes in Computer Science) , 2007 .

[70]  Rainer Stiefelhagen,et al.  Multimodal Technologies for Perception of Humans, First International Evaluation Workshop on Classification of Events, Activities and Relationships, CLEAR 2006, Southampton, UK, April 6-7, 2006, Revised Selected Papers , 2007, CLEAR.

[71]  Khalid Choukri,et al.  The CHIL audiovisual corpus for lecture and meeting analysis inside smart rooms , 2007, Lang. Resour. Evaluation.

[72]  Ambrish Tyagi,et al.  Fusion of Multiple Camera Views for Kernel-Based 3D Tracking , 2007, 2007 IEEE Workshop on Motion and Video Computing (WMVC'07).

[73]  Jonathan G. Fiscus,et al.  Multimodal Technologies for Perception of Humans, International Evaluation Workshops CLEAR 2007 and RT 2007, Baltimore, MD, USA, May 8-11, 2007, Revised Selected Papers , 2008, CLEAR.

[74]  Rainer Stiefelhagen,et al.  Computers in the Human Interaction Loop , 2009, Human-Computer Interaction Series.