An Online Algorithm for Constrained Face Clustering in Videos

We address the problem of face clustering in long, real world videos. This is a challenging task because faces in such videos exhibit wide variability in scale, pose, illumination, expressions, and may also be partially occluded. The majority of the existing face clustering algorithms are offline, i.e., they assume the availability of the entire data at once. However, in many practical scenarios, complete data may not be available at the same time or may be too large to process or may exhibit significant variation in the data distribution over time. We propose an online clustering algorithm that processes data sequentially in short segments of variable length. The faces detected in each segment are either assigned to an existing cluster or are used to create a new one. Our algorithm uses several spatiotemporal constraints, and a convolutional neural network (CNN) to obtain a robust representation of the faces in order to achieve high clustering accuracy on two benchmark video databases (82.1 % and 93.8%). Despite being an online method (usually known to have lower accuracy), our algorithm achieves comparable or better results than state-of-the-art offline and online methods.

[1]  Qiang Ji,et al.  Constrained Clustering and Its Application to Face Clustering in Videos , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[4]  Xiaochun Cao,et al.  Constrained Multi-View Video Face Clustering , 2015, IEEE Transactions on Image Processing.

[5]  Yihong Gong,et al.  Deep Metric Learning with Improved Triplet Loss for Face Clustering in Videos , 2016, PCM.

[6]  Tanaya Guha,et al.  Gender Representation in Cinematic Content: A Multimodal Approach , 2015, ICMI.

[7]  Tal Hassner,et al.  Face recognition in unconstrained videos with matched background similarity , 2011, CVPR 2011.

[8]  Chiranjib Bhattacharyya,et al.  Bayesian Modeling of Temporal Coherence in Videos for Entity Discovery and Summarization , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Julia Hirschberg,et al.  V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure , 2007, EMNLP.

[10]  Davis E. King,et al.  Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[11]  Xiaogang Wang,et al.  Deep Learning Face Representation by Joint Identification-Verification , 2014, NIPS.

[12]  Marcel Worring,et al.  Systematic evaluation of logical story unit segmentation , 2002, IEEE Trans. Multim..

[13]  Xiaoou Tang,et al.  Joint Face Representation Adaptation and Clustering in Videos , 2016, ECCV.

[14]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[15]  Andrew Zisserman,et al.  Hello! My name is... Buffy'' -- Automatic Naming of Characters in TV Video , 2006, BMVC.

[16]  Qiang Ji,et al.  A Coupled Hidden Markov Random Field model for simultaneous face clustering and tracking in videos , 2017, Pattern Recognit..

[17]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Rama Chellappa,et al.  Face Association across Unconstrained Video Frames Using Conditional Random Fields , 2012, ECCV.

[19]  M. Saquib Sarfraz,et al.  A Simple and Effective Technique for Face Clustering in TV Series , 2017 .

[20]  Cordelia Schmid,et al.  Unsupervised metric learning for face identification in TV video , 2011, 2011 International Conference on Computer Vision.