Face, Body, Voice: Video Person-Clustering with Multiple Modalities

The objective of this work is person-clustering in videos – grouping characters according to their identity. Previous methods focus on the narrower task of face-clustering, and for the most part ignore other cues such as the person’s voice, their overall appearance (hair, clothes, posture), and the editing structure of the videos. Similarly, most current datasets evaluate only the task of face-clustering, rather than person-clustering. This limits their applicability to downstream applications such as story understanding which require person-level, rather than only face-level, reasoning.In this paper we make contributions to address both these deficiencies: first, we introduce a Multi-Modal High-Precision Clustering algorithm for person-clustering in videos using cues from several modalities (face, body, and voice). Second, we introduce a Video Person-Clustering dataset, for evaluating multi-modal person-clustering. It contains body-tracks for each annotated character, face-tracks when visible, and voice-tracks when speaking, with their associated features. The dataset is by far the largest of its kind, and covers films and TV-shows representing a wide range of demographics. Finally, we show the effectiveness of using multiple modalities for person-clustering, explore the use of this new broad task for story understanding through character co-occurrences, and achieve a new state of the art on all available datasets for face and person-clustering.

[1]  Jonathan G. Fiscus,et al.  TRECVID 2019: An evaluation campaign to benchmark Video Activity Detection, Video Captioning and Matching, and Video Search & retrieval , 2019, TRECVID.

[2]  Fei-Fei Li,et al.  Linking People in Videos with "Their" Names Using Coreference Resolution , 2014, ECCV.

[3]  Andrew Zisserman,et al.  Taking the bite out of automated naming of characters in TV video , 2009, Image Vis. Comput..

[4]  Andrew Zisserman,et al.  Hello! My name is... Buffy'' -- Automatic Naming of Characters in TV Video , 2006, BMVC.

[5]  Omkar M. Parkhi,et al.  VGGFace2: A Dataset for Recognising Faces across Pose and Age , 2017, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[6]  Shrikanth Narayanan,et al.  Multi-Face: Self-supervised Multiview Adaptation for Robust Face Clustering in Videos , 2020, ArXiv.

[7]  Andrew Zisserman,et al.  Automated Video Labelling: Identifying Faces by Corroborative Evidence , 2021, 2021 IEEE 4th International Conference on Multimedia Information Processing and Retrieval (MIPR).

[8]  Xiaogang Wang,et al.  DeepReID: Deep Filter Pairing Neural Network for Person Re-identification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Rainer Stiefelhagen,et al.  Semi-supervised Learning with Constraints for Person Identification in Multimedia Data , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Dahua Lin,et al.  Online Multi-modal Person Search in Videos , 2020, ECCV.

[11]  Hang Su,et al.  End-to-End Face Detection and Cast Grouping in Movies Using Erdös-Rényi Clustering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Bolei Zhou,et al.  A Local-to-Global Approach to Multi-Modal Movie Scene Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Renato Cordeiro de Amorim,et al.  Constrained clustering with Minkowski Weighted K-Means , 2012, 2012 IEEE 13th International Symposium on Computational Intelligence and Informatics (CINTI).

[14]  Rainer Stiefelhagen,et al.  Accio: A Data Set for Face Track Retrieval in Movies Across Age , 2015, ICMR.

[15]  Cordelia Schmid,et al.  Unsupervised metric learning for face identification in TV video , 2011, 2011 International Conference on Computer Vision.

[16]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[17]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Andrew Zisserman,et al.  Smooth-AP: Smoothing the Path Towards Large-Scale Image Retrieval , 2020, ECCV.

[19]  Andrew Zisserman,et al.  Learnable PINs: Cross-Modal Embeddings for Person Identity , 2018, ECCV.

[20]  Dahua Lin,et al.  Person Search in Videos with One Portrait Through Visual and Temporal Links , 2018, ECCV.

[21]  Cordelia Schmid,et al.  Finding Actors and Actions in Movies , 2013, 2013 IEEE International Conference on Computer Vision.

[22]  Cordelia Schmid,et al.  Is that you? Metric learning approaches for face identification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[23]  Richard Szeliski,et al.  Finding People in Repeated Shots of the Same Scene , 2006, BMVC.

[24]  Camille Guinaudeau,et al.  TVD: A Reproducible and Multiply Aligned TV Series Dataset , 2014, LREC.

[25]  Luc Van Gool,et al.  Learning To Classify Images Without Labels , 2020, ECCV.

[26]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[27]  Joon Son Chung,et al.  Utterance-level Aggregation for Speaker Recognition in the Wild , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Ivan Laptev,et al.  Learning Interactions and Relationships Between Movie Characters , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Eric Sommerlade,et al.  Total Cluster: A person agnostic clustering method for broadcast videos , 2014, ICVGIP '14.

[30]  M. Saquib Sarfraz,et al.  Clustering based Contrastive Learning for Improving Face Representations , 2020, 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020).

[31]  Dong Xu,et al.  Weighted Block-Sparse Low Rank Representation for Face Clustering in Videos , 2014, ECCV.

[32]  Rainer Stiefelhagen,et al.  Self-supervised Face-Grouping on Graphs , 2019, ACM Multimedia.

[33]  Andrew Zisserman,et al.  From Benedict Cumberbatch to Sherlock Holmes: Character Identification in TV series without a Script , 2018, BMVC.

[34]  Andrew Zisserman,et al.  Constrained Video Face Clustering using1NN Relations , 2020, BMVC.

[35]  Andrew Brown,et al.  Thinking Through and Writing About Research Ethics Beyond "Broader Impact" , 2021, ArXiv.

[36]  Andrew W. Fitzgibbon,et al.  On Affine Invariant Clustering and Automatic Cast Listing in Movies , 2002, ECCV.

[37]  Bhanukiran Vinzamuri,et al.  A Survey of Partitional and Hierarchical Clustering Algorithms , 2018, Data Clustering: Algorithms and Applications.

[38]  Qi Tian,et al.  Scalable Person Re-identification: A Benchmark , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[39]  Xiaoou Tang,et al.  Joint Face Representation Adaptation and Clustering in Videos , 2016, ECCV.

[40]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[41]  Sanja Fidler,et al.  Video Face Clustering With Unknown Number of Clusters , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  Makarand Tapaswi,et al.  StoryGraphs: Visualizing Character Interactions as a Timeline , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  M. Saquib Sarfraz,et al.  Self-Supervised Learning of Face Representations for Video Face Clustering , 2019, 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019).

[44]  Andrew Zisserman,et al.  LAEO-Net: Revisiting People Looking at Each Other in Videos , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  O. Parkhi It ’ s in the bag : Stronger supervision for automated face labelling , 2015 .

[46]  Ben Taskar,et al.  Talking pictures: Temporal grouping and dialog-supervised person recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[47]  Sham M. Kakade,et al.  Leveraging archival video for building face datasets , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[48]  Sanja Fidler,et al.  MovieGraphs: Towards Understanding Human-Centric Situations from Videos , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[49]  M. Saquib Sarfraz,et al.  Efficient Parameter-Free Clustering Using First Neighbor Relations , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Yuxiao Hu,et al.  MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition , 2016, ECCV.

[51]  David J. Kriegman,et al.  Clustering appearances of objects under varying illumination conditions , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[52]  Joon Son Chung,et al.  In defence of metric learning for speaker recognition , 2020, INTERSPEECH.

[53]  Horst Bischof,et al.  Multiple Instance Boosting for Face Recognition in Videos , 2011, DAGM-Symposium.

[54]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55]  Erica Klarreich,et al.  Hello, my name is… , 2014, CACM.

[56]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[57]  Andrew Zisserman,et al.  “Who are you?” - Learning person specific classifiers from video , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[58]  Qiang Ji,et al.  Constrained Clustering and Its Application to Face Clustering in Videos , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  Andrew Zisserman,et al.  Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[60]  Seong Joon Oh,et al.  Person Recognition in Personal Photo Collections , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[61]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[62]  Horst Bischof,et al.  Learning to recognize faces from videos and weakly related information cues , 2011, 2011 8th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[63]  Anil K. Jain,et al.  Clustering Millions of Faces by Identity , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[64]  Rainer Stiefelhagen,et al.  Naming TV characters by watching and analyzing dialogs , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[65]  Dahua Lin,et al.  MovieNet: A Holistic Dataset for Movie Understanding , 2020, ECCV.

[66]  Rainer Stiefelhagen,et al.  “Knock! Knock! Who is it?” probabilistic person identification in TV-series , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[67]  Andrea Vedaldi,et al.  Labelling unlabelled videos from scratch with multi-modal self-supervision , 2020, NeurIPS.

[68]  Louis Chevallier,et al.  On evaluating face tracks in movies , 2013, 2013 IEEE International Conference on Image Processing.

[69]  Nuno Vasconcelos,et al.  Cascade R-CNN: Delving Into High Quality Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[70]  Andrew Zisserman,et al.  Thread-Safe: Towards Recognizing Human Actions Across Shot Boundaries , 2014, ACCV.

[71]  Cheng Li,et al.  Merge or Not? Learning to Group Faces via Imitation Learning , 2018, AAAI.

[72]  Carlos D. Castillo,et al.  Deep Density Clustering of Unconstrained Faces , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[73]  Tamara L. Berg,et al.  Names and faces in the news , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[74]  M. Saquib Sarfraz,et al.  Video Face Clustering With Self-Supervised Representation Learning , 2020, IEEE Transactions on Biometrics, Behavior, and Identity Science.

[75]  Qiang Ji,et al.  Simultaneous Clustering and Tracklet Linking for Multi-face Tracking in Videos , 2013, 2013 IEEE International Conference on Computer Vision.

[76]  Christine Sénac,et al.  StoViz: story visualization of TV series , 2012, ACM Multimedia.

[77]  B. Taskar,et al.  Learning from ambiguously labeled images , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[78]  Ning Zhang,et al.  Beyond frontal faces: Improving Person Recognition using multiple cues , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[79]  Ray A. Jarvis,et al.  Clustering Using a Similarity Measure Based on Shared Near Neighbors , 1973, IEEE Transactions on Computers.

[80]  Longhui Wei,et al.  Person Transfer GAN to Bridge Domain Gap for Person Re-identification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[81]  Yi Yang,et al.  Unlabeled Samples Generated by GAN Improve the Person Re-identification Baseline in Vitro , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[82]  Andrew Zisserman,et al.  Condensed Movies: Story Based Retrieval with Contextual Embeddings , 2020, ACCV.

[83]  Claude Barras,et al.  Multimodal person discovery in broadcast TV: lessons learned from MediaEval 2015 , 2017, Multimedia Tools and Applications.