Exploiting scene maps and spatial relationships in quasi-static scenes for video face clustering

Video face clustering is a fundamental step in automatically annotating a video in terms of when and where (i.e., in which video shot and where in a video frame) a given person is visible. State-of-the-art face clustering solutions typically rely on the information derived from visual appearances of the face images. This is challenging because of a high degree of variation in these visual appearances due to factors like scale, viewpoint, head pose and facial expression. As a result, either the generated face clusters are not sufficiently pure, or their number is much higher than that of people appearing in the video. A possible way towards improved clustering performance is to analyze visual appearances of faces in specific contexts and take the contextual information into account when designing the clustering algorithm. In this paper, we focus on the context of quasi-static scenes, in which we can assume that the people's positions in a scene are (quasi-)stationary. We present a novel video clustering algorithm that exploits this property to match faces and efficiently propagate face labels across the scope of viewpoints, scale and level of zoom characterizing different frames and shots of a video. We also present a novel publicly available dataset of manually annotated quasi-static scene videos. Experimental assessment on the latter indicates that exploiting information derived by the scene and the spatial relationships between people can substantially improve the clustering performance compared to the state-of-the-art in the field. We propose a video face clustering algorithm for quasi-static scene (QSS) videos.QSS videos include e.g., talk shows, TV debates, TV games and symphonic concerts.A map of the scene and the spatial relationships between people are inferred.By using them, we match faces in crowded shots with lack of visual detail.We show that spatial information is the missing piece for effective face clustering.

[1]  Thierry Chateau,et al.  A multi-cue spatio-temporal framework for automatic frontal face clustering in video sequences , 2013, EURASIP J. Image Video Process..

[2]  Subhradeep Kayal,et al.  Face clustering in videos: GMM-based hierarchical clustering using Spatio-Temporal data , 2013, 2013 13th UK Workshop on Computational Intelligence (UKCI).

[3]  Luc Van Gool,et al.  Face Detection without Bells and Whistles , 2014, ECCV.

[4]  Fernando De la Torre,et al.  Deformable Graph Matching , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Richard Szeliski,et al.  Efficiently registering video into panoramic mosaics , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[6]  Patrick J. Flynn,et al.  Active Clustering with Ensembles for Social structure extraction , 2014, IEEE Winter Conference on Applications of Computer Vision.

[7]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[8]  Liyan Zhang,et al.  Context-assisted face clustering framework with human-in-the-loop , 2014, International Journal of Multimedia Information Retrieval.

[9]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.

[10]  Julio Gonzalo,et al.  A comparison of extrinsic clustering evaluation metrics based on formal constraints , 2008, Information Retrieval.

[11]  Ioannis Pitas,et al.  A mutual information based face clustering algorithm for movie content analysis , 2011, Image Vis. Comput..

[12]  Andrew Zisserman,et al.  "Who are you?" - Learning person specific classifiers from video , 2009, CVPR.

[13]  Alan Hanjalic,et al.  On detecting the playing/non-playing activity of musicians in symphonic music videos , 2016, Comput. Vis. Image Underst..

[14]  Azriel Rosenfeld,et al.  Face recognition: A literature survey , 2003, CSUR.

[15]  Jiřı́ Matas,et al.  Real-time scene text localization and recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Xiao Bai,et al.  Graph-Based Methods in Computer Vision: Developments and Applications , 2012 .

[17]  Jean-Marc Odobez,et al.  Fusing matching and biometric similarity measures for face diarization in video , 2013, ICMR '13.

[18]  Xiaoqing Ding,et al.  Person-based video summarization and retrieval by tracking and clustering temporal face sequences , 2013, Electronic Imaging.

[19]  Xiaochun Cao,et al.  Diversity-induced Multi-view Subspace Clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Mahito Fujii,et al.  Skin Region Extraction and Person-Independent Deformable Face Templates for Fast Video Indexing , 2011, 2011 IEEE International Symposium on Multimedia.

[21]  Peng Wu,et al.  Improving face clustering using social context , 2010, ACM Multimedia.

[22]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[23]  Alberto Del Bimbo,et al.  Using 3D Models to Recognize 2D Faces in the Wild , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[24]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[25]  Jian Sun,et al.  A rank-order distance based clustering algorithm for face tagging , 2011, CVPR 2011.

[26]  Philippe Joly,et al.  Face-and-clothing based people clustering in video content , 2010, MIR '10.

[27]  Dong Xu,et al.  Weighted Block-Sparse Low Rank Representation for Face Clustering in Videos , 2014, ECCV.

[28]  Philippe Joly,et al.  Audiovisual diarization of people in video content , 2012, Multimedia Tools and Applications.

[29]  Liyan Zhang,et al.  A unified framework for context assisted face clustering , 2013, ICMR '13.

[30]  Qiang Ji,et al.  Constrained Clustering and Its Application to Face Clustering in Videos , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Vasileios Mezaris,et al.  Fast shot segmentation combining global and local visual descriptors , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Steven M. Seitz,et al.  Photo tourism: exploring photo collections in 3D , 2006, ACM Trans. Graph..

[33]  Zdenek Kalal,et al.  Tracking-Learning-Detection , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Georges Quénot,et al.  Unsupervised naming of speakers in broadcast TV: using written names, pronounced names or both? , 2013, INTERSPEECH.

[35]  Mathias Lux,et al.  Lire: lucene image retrieval: an extensible java CBIR library , 2008, ACM Multimedia.