Clustering based Contrastive Learning for Improving Face Representations

A good clustering algorithm can discover natural groupings in data. These groupings, if used wisely, provide a form of weak supervision for learning representations. In this work, we present Clustering-based Contrastive Learning (CCL), a new clustering-based representation learning approach that uses labels obtained from clustering along with video constraints to learn discriminative face features. We demonstrate our method on the challenging task of learning representations for video face clustering. Through several ablation studies, we analyze the impact of creating pair-wise positive and negative labels from different sources. Experiments on three challenging video face clustering datasets: BBT-0101, BF-0502, and ACCIO show that CCL achieves a new state-of-the-art on all datasets.

[1]  Ivan Laptev,et al.  Learning from Video and Text via Large-Scale Discriminative Clustering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[2]  Martial Hebert,et al.  Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification , 2016, ECCV.

[3]  Rainer Stiefelhagen,et al.  “Knock! Knock! Who is it?” probabilistic person identification in TV-series , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Yuxiao Hu,et al.  MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition , 2016, ECCV.

[5]  M. Saquib Sarfraz,et al.  Self-Supervised Learning of Face Representations for Video Face Clustering , 2019, 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019).

[6]  Larry S. Davis,et al.  Covariance discriminative learning: A natural and efficient approach to image set classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Yihong Gong,et al.  Deep Metric Learning with Improved Triplet Loss for Face Clustering in Videos , 2016, PCM.

[8]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Rainer Stiefelhagen,et al.  Accio: A Data Set for Face Track Retrieval in Movies Across Age , 2015, ICMR.

[10]  Luc Van Gool,et al.  Image-level Classification in Hyperspectral Images using Feature Descriptors, with Application to Face Recognition , 2016, ArXiv.

[11]  Liyan Zhang,et al.  A unified framework for context assisted face clustering , 2013, ICMR '13.

[12]  Dongqing Zhang,et al.  Neural Aggregation Network for Video Face Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[15]  Jianping Yin,et al.  Improved Deep Embedded Clustering with Local Structure Preservation , 2017, IJCAI.

[16]  Efstratios Gavves,et al.  Self-Supervised Video Representation Learning with Odd-One-Out Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Rong Yan,et al.  Negative pseudo-relevance feedback in content-based video retrieval , 2003, MULTIMEDIA '03.

[18]  Yu Liu,et al.  Quality Aware Network for Set to Set Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Qiang Ji,et al.  Simultaneous Clustering and Tracklet Linking for Multi-face Tracking in Videos , 2013, 2013 IEEE International Conference on Computer Vision.

[20]  Hang Su,et al.  End-to-End Face Detection and Cast Grouping in Movies Using Erdös-Rényi Clustering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[21]  Ramakant Nevatia,et al.  Robust multi-pose face tracking by multi-stage tracklet association , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[22]  Jean-Marc Odobez,et al.  A conditional random field approach for audio-visual people diarization , 2014, ICASSP.

[23]  Luc Van Gool,et al.  Temporal 3D ConvNets Using Temporal Transition Layer , 2018, CVPR Workshops.

[24]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[25]  Cordelia Schmid,et al.  Unsupervised metric learning for face identification in TV video , 2011, 2011 International Conference on Computer Vision.

[26]  Andrew Zisserman,et al.  Deep Face Recognition , 2015, BMVC.

[27]  Gaurav Sharma,et al.  Unsupervised Learning of Face Representations , 2018, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[28]  Julio Gonzalo,et al.  A comparison of extrinsic clustering evaluation metrics based on formal constraints , 2009, Information Retrieval.

[29]  Omkar M. Parkhi,et al.  VGGFace2: A Dataset for Recognising Faces across Pose and Age , 2017, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[30]  Erica Klarreich,et al.  Hello, my name is… , 2014, CACM.

[31]  Huachun Tan,et al.  Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering , 2016, IJCAI.

[32]  Xiaoou Tang,et al.  Joint Face Representation Adaptation and Clustering in Videos , 2016, ECCV.

[33]  Nitish Srivastava Unsupervised Learning of Visual Representations using Videos , 2015 .

[34]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[35]  Odobez Jean-Marc,et al.  A conditional random field approach for audio-visual people diarization , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  D. Sculley,et al.  Web-scale k-means clustering , 2010, WWW '10.

[37]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Makarand Tapaswi,et al.  Deep Multimodal Feature Encoding for Video Ordering , 2020, ArXiv.

[39]  Cheng Li,et al.  Merge or Not? Learning to Group Faces via Imitation Learning , 2018, AAAI.

[40]  M. Saquib Sarfraz,et al.  A Simple and Effective Technique for Face Clustering in TV Series , 2017 .

[41]  Rainer Stiefelhagen,et al.  Semi-supervised Learning with Constraints for Person Identification in Multimedia Data , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  M. Saquib Sarfraz,et al.  Efficient Parameter-Free Clustering Using First Neighbor Relations , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Dong Xu,et al.  Weighted Block-Sparse Low Rank Representation for Face Clustering in Videos , 2014, ECCV.

[44]  Andrew Zisserman,et al.  A Compact and Discriminative Face Track Descriptor , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Sanja Fidler,et al.  Video Face Clustering With Unknown Number of Clusters , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[46]  Ali Farhadi,et al.  Unsupervised Deep Embedding for Clustering Analysis , 2015, ICML.

[47]  Dhruv Batra,et al.  Joint Unsupervised Learning of Deep Representations and Image Clusters , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Xiaochun Cao,et al.  Multi-cue Augmented Face Clustering , 2015, ACM Multimedia.

[49]  Andrew Zisserman,et al.  Hello! My name is... Buffy'' -- Automatic Naming of Characters in TV Video , 2006, BMVC.

[50]  Eric Sommerlade,et al.  Total Cluster: A person agnostic clustering method for broadcast videos , 2014, ICVGIP '14.

[51]  Rainer Stiefelhagen,et al.  Naming TV characters by watching and analyzing dialogs , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[52]  Qiang Ji,et al.  Constrained Clustering and Its Application to Face Clustering in Videos , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[53]  Andrew Zisserman,et al.  From Benedict Cumberbatch to Sherlock Holmes: Character Identification in TV series without a Script , 2017, BMVC.

[54]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[55]  Matthijs Douze,et al.  Deep Clustering for Unsupervised Learning of Visual Features , 2018, ECCV.