Contextual Person Identification in Multimedia Data

Automatic tracking and identification of faces and persons are essential tasks in many video analysis systems, for example to automatically generate meta data or as basis for higher level applications. In many cases, identification is based on a single modality such as faces. In this work, we propose methods to improve person identification by integration of multiple cues including multiple modalities and contextual information. We motivate and evaluate our proposed methods in the context of multimedia data, specifically TV series. Despite its usually high resolution, multimedia data presents many challenges. For example, camera views change constantly at shot boundaries, the camera position is generally unknown and image conditions and poses of faces and poses can change rapidly due to the underlying plot. Since we make only few assumptions about the underlying data, our methods are applicable to other domains as well, for example in the area of safety and security. Before we can identify a face, it has to be localized in the image first. In videos we can further associate localizations over time to consecutive face tracks. Face tracks can then be identified jointly and errors in single frames (e.g., due to noise in the data or imprecise localization) can be mitigated, improving overall identification accuracy. In this work, we propose a detector-based face tracking approach based on a large bank of detectors which cover a range of head poses. We integrate the detectors in a particle filter such that these can be used efficiently, i.e. only one detector out of 49 is evaluated for each particle. We evaluate our approach on a data set of two TV series, which we annotated with ground truth face positions. The data set contains over 100 000 annotated faces and is one of the largest public data sets available for the evaluation of face tracking. With our proposed tracking approach we achieve an improvement of 0.15 in Multiple Object

[1]  Ming-Ching Chang,et al.  Gaze and body pose estimation from a distance , 2011, 2011 8th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[2]  Yoshua Bengio,et al.  Semi-supervised Learning by Entropy Minimization , 2004, CAP.

[3]  Subhransu Maji,et al.  Classification using intersection kernel support vector machines is efficient , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Dale Schuurmans,et al.  Maximum Margin Clustering , 2004, NIPS.

[5]  Azriel Rosenfeld,et al.  Face recognition: A literature survey , 2003, CSUR.

[6]  Mubarak Shah,et al.  Face Recognition in Movie Trailers via Mean Sequence Sparse Representation-Based Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Marco Gori,et al.  Semi-supervised Learning with Constraints for Multi-view Object Recognition , 2009, ICANN.

[8]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[9]  Ming Zhao,et al.  Audiovisual celebrity recognition in unconstrained web videos , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Rainer Stiefelhagen,et al.  Part-based clothing segmentation for person retrieval , 2011, 2011 8th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[11]  Alex Waibel,et al.  Face locating and tracking for human-computer interaction , 1994, Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers.

[12]  Mario Fritz,et al.  The Pooled NBNN Kernel: Beyond Image-to-Class and Image-to-Image , 2012, ACCV.

[13]  Yi Yao,et al.  Learning to recognize people in a smart environment , 2011, 2011 8th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[14]  Ramakant Nevatia,et al.  How does person identity recognition help multi-person tracking? , 2011, CVPR 2011.

[15]  Luc Van Gool,et al.  Improved person detection in industrial environments using multiple self-calibrated cameras , 2011, 2011 8th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[16]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[17]  Rainer Stiefelhagen,et al.  Tracking focus of attention in meetings , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[18]  Christopher Joseph Pal,et al.  Semi Supervised Learning for Wild Faces and Video , 2011, British Machine Vision Conference.

[19]  Horst Bischof,et al.  Learning to recognize faces from videos and weakly related information cues , 2011, 2011 8th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[20]  Ram Nevatia,et al.  Detection and Segmentation of Multiple, Partially Occluded Objects by Grouping, Merging, Assigning Part Detection Responses , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  S. R. Jammalamadaka,et al.  Directional Statistics, I , 2011 .

[22]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[23]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[24]  Ramakant Nevatia,et al.  Tracking of Multiple, Partially Occluded Humans based on Static Body Part Detection , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[25]  Alexander H. Waibel,et al.  A real-time face tracker , 1996, Proceedings Third IEEE Workshop on Applications of Computer Vision. WACV'96.

[26]  Pramod Sharma,et al.  Unsupervised incremental learning for improved object detection in a video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  N. Gordon,et al.  Novel approach to nonlinear/non-Gaussian Bayesian state estimation , 1993 .

[28]  G. Jaffré,et al.  Costume: a new feature for automatic video content indexing , 2004 .

[29]  Xihong Wu,et al.  Boosting Local Binary Pattern (LBP)-Based Face Recognition , 2004, SINOBIOMETRICS.

[30]  Xiaogang Wang,et al.  DeepReID: Deep Filter Pairing Neural Network for Person Re-identification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Shaogang Gong,et al.  Person Re-Identification by Support Vector Ranking , 2010, BMVC.

[32]  D. Greig,et al.  Exact Maximum A Posteriori Estimation for Binary Images , 1989 .

[33]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[34]  Fernando De la Torre,et al.  Supervised Descent Method and Its Applications to Face Alignment , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Jean-Marc Odobez,et al.  Fusing matching and biometric similarity measures for face diarization in video , 2013, ICMR '13.

[36]  Horst Bischof,et al.  Relaxed Pairwise Learned Metric for Person Re-identification , 2012, ECCV.

[37]  Wei Zhang,et al.  Clothing-based person clustering in family photos , 2010, 2010 IEEE International Conference on Image Processing.

[38]  Ching-Yung Lin,et al.  Cross-Modality Automatic Face Model Training from Large Video Databases , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[39]  Richard Szeliski,et al.  Finding People in Repeated Shots of the Same Scene , 2006, BMVC.

[40]  Ram Nevatia,et al.  Learning to associate: HybridBoosted multi-target tracker for crowded scene , 2009, CVPR.

[41]  Tony Jebara,et al.  A Kernel Between Sets of Vectors , 2003, ICML.

[42]  Andrew Zisserman,et al.  A Compact and Discriminative Face Track Descriptor , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Tomaso A. Poggio,et al.  Face recognition with support vector machines: global versus component-based approach , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[44]  Tomaso A. Poggio,et al.  Full-body person recognition system , 2003, Pattern Recognit..

[45]  Sham M. Kakade,et al.  Leveraging archival video for building face datasets , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[46]  Rainer Stiefelhagen,et al.  Multi-view head pose estimation using neural networks , 2005, The 2nd Canadian Conference on Computer and Robot Vision (CRV'05).

[47]  Shaogang Gong,et al.  Towards Person Identification and Re-identification with Attributes , 2012, ECCV Workshops.

[48]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[49]  Changsheng Xu,et al.  Character Identification in Feature-Length Films Using Global Face-Name Matching , 2009, IEEE Transactions on Multimedia.

[50]  Rainer Stiefelhagen,et al.  Cleaning up after a face tracker: False positive removal , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[51]  Renjie Liao,et al.  CoDeL: A Human Co-detection and Labeling Framework , 2013, 2013 IEEE International Conference on Computer Vision.

[52]  Cordelia Schmid,et al.  Multiple Instance Metric Learning from Automatically Labeled Bags of Faces , 2010, ECCV.

[53]  Song Wang,et al.  A graph-based algorithm for multi-target tracking with occlusion , 2013, 2013 IEEE Workshop on Applications of Computer Vision (WACV).

[54]  Jitendra Malik,et al.  Object detection using a max-margin Hough transform , 2009, CVPR.

[55]  David A. Forsyth,et al.  Tracking People by Learning Their Appearance , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Andrew Zisserman,et al.  Sparse kernel approximations for efficient classification and detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[57]  Ben Xianye Face detection and tracking in video , 2011 .

[58]  Cordelia Schmid,et al.  Face recognition from caption-based supervision , 2010 .

[59]  C. V. Jawahar,et al.  Subtitle-free Movie to Script Alignment , 2009, BMVC.

[60]  Shree K. Nayar,et al.  Attribute and simile classifiers for face verification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[61]  Rong Yan,et al.  Multiple instance learning for labeling faces in broadcasting news video , 2005, MULTIMEDIA '05.

[62]  Ingrid Neteland [Where are you?]. , 2013, Tidsskrift for den Norske laegeforening : tidsskrift for praktisk medicin, ny raekke.

[63]  Luc Van Gool,et al.  Coupled Detection and Trajectory Estimation for Multi-Object Tracking , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[64]  Oswald Lanz,et al.  Approximate Bayesian multibody tracking , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[65]  Ramakant Nevatia,et al.  Robust Object Tracking by Hierarchical Association of Detection Responses , 2008, ECCV.

[66]  Michael Isard,et al.  CONDENSATION—Conditional Density Propagation for Visual Tracking , 1998, International Journal of Computer Vision.

[67]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[68]  Barbara Caputo,et al.  Recognition with local features: the kernel recipe , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[69]  Horst Bischof,et al.  Large scale metric learning from equivalence constraints , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[70]  Michael C. Nechyba,et al.  PittPatt Face Detection and Tracking for the CLEAR 2007 Evaluation , 2007, CLEAR.

[71]  Changsheng Xu,et al.  Robust Face-Name Graph Matching for Movie Character Identification , 2012, IEEE Transactions on Multimedia.

[72]  Ajmal S. Mian,et al.  Sparse approximated nearest points for image set classification , 2011, CVPR 2011.

[73]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[74]  William J. Christmas,et al.  A Study on Automatic Shot Change Detection , 1998, ECMAST.

[75]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[76]  Georges Quénot,et al.  Unsupervised Speaker Identification using Overlaid Texts in TV Broadcast , 2012, INTERSPEECH.

[77]  Johannes Stallkamp,et al.  Video-based Face Recognition on Real-World Data , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[78]  Rogério Schmidt Feris,et al.  Attribute-based people search in surveillance environments , 2009, 2009 Workshop on Applications of Computer Vision (WACV).

[79]  Yuh-Jye Lee,et al.  RSVM: Reduced Support Vector Machines , 2001, SDM.

[80]  Rainer Stiefelhagen,et al.  Tracking head pose and focus of attention with multiple far-field cameras , 2006, ICMI '06.

[81]  Rabia Jafri,et al.  A Survey of Face Recognition Techniques , 2009, J. Inf. Process. Syst..

[82]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[83]  Rong Yan,et al.  A Discriminative Learning Framework with Pairwise Constraints for Video Object Classification , 2006, IEEE Trans. Pattern Anal. Mach. Intell..

[84]  T. Başar,et al.  A New Approach to Linear Filtering and Prediction Problems , 2001 .

[85]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[86]  Jiri Matas,et al.  P-N learning: Bootstrapping binary classifiers by structural constraints , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[87]  Qiang Ji,et al.  Constrained Clustering and Its Application to Face Clustering in Videos , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[88]  Mubarak Shah,et al.  Tracking Multiple Occluding People by Localizing on Multiple Scene Planes , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[89]  Larry S. Davis,et al.  Covariance discriminative learning: A natural and efficient approach to image set classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[90]  Michael Isard,et al.  A mixed-state condensation tracker with automatic model-switching , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[91]  Yiu-ming Cheung,et al.  Semi-Supervised Maximum Margin Clustering with Pairwise Constraints , 2012, IEEE Transactions on Knowledge and Data Engineering.

[92]  Ramakant Nevatia,et al.  Robust multi-pose face tracking by multi-stage tracklet association , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[93]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[94]  Hai Tao,et al.  Viewpoint Invariant Pedestrian Recognition with an Ensemble of Localized Features , 2008, ECCV.

[95]  Rita Cucchiara,et al.  People reidentification in surveillance and forensics , 2013, ACM Comput. Surv..

[96]  Ramakant Nevatia,et al.  Multi-target tracking by online learning of non-linear motion patterns and robust appearance models , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[97]  Daniel Wolf,et al.  Hypergraphs for Joint Multi-view Reconstruction and Multi-object Tracking , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[98]  James J. Little,et al.  A Linear Programming Approach for Multiple Object Tracking , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[99]  Zhenguo Li,et al.  Constrained clustering by spectral kernel learning , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[100]  Luis E. Ortiz,et al.  Parsing clothing in fashion photographs , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[101]  Ramakant Nevatia,et al.  Online Learned Discriminative Part-Based Appearance Models for Multi-human Tracking , 2012, ECCV.

[102]  Bernt Schiele,et al.  Towards Robust Pedestrian Detection in Crowded Image Sequences , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[103]  Andrew Zisserman,et al.  Person Spotting: Video Shot Retrieval for Face Sets , 2005, CIVR.

[104]  Ming Zhao,et al.  Large scale learning and recognition of faces in web videos , 2008, 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition.

[105]  Vladimir Pavlovic,et al.  Face tracking and recognition with visual constraints in real-world videos , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[106]  Bernt Schiele,et al.  Robust Object Detection with Interleaved Categorization and Segmentation , 2008, International Journal of Computer Vision.

[107]  Wei Liu,et al.  Learning Distance Metrics with Contextual Constraints for Image Retrieval , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[108]  Liyan Zhang,et al.  A unified framework for context assisted face clustering , 2013, ICMR '13.

[109]  Horst Bischof,et al.  On-line semi-supervised multiple-instance boosting , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[110]  Xuran Zhao,et al.  Semi-supervised face recognition with LDA self-training , 2011, 2011 18th IEEE International Conference on Image Processing.

[111]  Rainer Stiefelhagen,et al.  “Knock! Knock! Who is it?” probabilistic person identification in TV-series , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[112]  Ramakant Nevatia,et al.  Global data association for multi-object tracking using network flows , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[113]  Gang Wei,et al.  Fusion of visual and audio features for person identification in real video , 2001, IS&T/SPIE Electronic Imaging.

[114]  Ramakant Nevatia,et al.  Detection and Tracking of Multiple, Partially Occluded Humans by Bayesian Combination of Edgelet based Part Detectors , 2007, International Journal of Computer Vision.

[115]  Gang Hua,et al.  Joint People, Event, and Location Recognition in Personal Photo Collections Using Cross-Domain Context , 2010, ECCV.

[116]  Larry S. Davis,et al.  Human detection using partial least squares analysis , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[117]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[118]  Francis R. Bach,et al.  A convex relaxation for weakly supervised classifiers , 2012, ICML.

[119]  Wen Gao,et al.  Manifold-Manifold Distance with application to face recognition based on image set , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[120]  Siwei Lyu,et al.  Mercer kernels for object recognition with local features , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[121]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[122]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[123]  Andrew Blake,et al.  "GrabCut" , 2004, ACM Trans. Graph..

[124]  Alex Pentland,et al.  Pfinder: real-time tracking of the human body , 1996, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition.

[125]  Koji Yamamoto,et al.  Fast face clustering based on shot similarity for browsing video , 2010 .