Naming multi-modal clusters to identify persons in TV broadcast

Persons’ identification in TV broadcast is one of the main tools to index this type of videos. The classical way is to use biometric face and speaker models, but, to cover a decent number of persons, costly annotations are needed. Over the recent years, several works have proposed to use other sources of names for identifying people, such as pronounced names and written names. The main idea is to form face/speaker clusters based on their similarities and to propagate these names onto clusters. In this paper, we propose a method to take advantage of written names during the diarization process, in order to both name clusters and prevent the fusion of two clusters named differently. First, we extract written names with the LOOV tool (Poignant et al. 2012); these names are associated to their co-occurring speaker turns / face tracks. Simultaneously, we build a multi-modal matrix of distances between speaker turns and face tracks. Then agglomerative clustering is performed on this matrix with the constraint to avoid merging clusters associated to different names. We also integrate the prediction of few biometric models (anchors, some journalists) to directly identify speaker turns / face tracks before the clustering process. Our approach was evaluated on the REPERE corpus and reached an F-measure of 68.2 % for speaker identification and 60.2 % for face identification. Adding few biometric models improves results and leads to 82.4 % and 65.6 % for speaker and face identity respectively. By comparison, a mono-modal, supervised person identification system with 706 speaker models trained on matching development data and additional TV and radio data provides 67.8 % F-measure, while 908 face models provide only 30.5 % F-measure.

[1]  Sylvain Meignier,et al.  Automatic named identification of speakers using diarization and ASR systems , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Václav Hlavác,et al.  Detector of Facial Landmarks Learned by the Structured Output SVM , 2012, VISAPP.

[3]  Takeo Kanade,et al.  Name-It: Naming and Detecting Faces in News Videos , 1999, IEEE Multim..

[4]  Delphine Charlet,et al.  Unsupervised face identification in TV content using audio-visual sources , 2013, 2013 11th International Workshop on Content-Based Multimedia Indexing (CBMI).

[5]  Georges Linarès,et al.  PERCOLI: A Person Identification System for the 2013 REPERE Challenge , 2013, SLAM@INTERSPEECH.

[6]  Paul Deléglise,et al.  Extracting true speaker identities from transcriptions , 2007, INTERSPEECH.

[7]  Georges Quénot,et al.  Fusion of Speech, Faces and Text for Person Identification in TV Broadcast , 2012, ECCV Workshops.

[8]  Olivier Galibert,et al.  The REPERE Corpus : a multimodal corpus for person recognition , 2012, LREC.

[9]  Jean-Luc Gauvain,et al.  Multistage speaker diarization of broadcast news , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Rainer Stiefelhagen,et al.  Multi-pose Face Recognition for Person Retrieval in Camera Networks , 2010, 2010 7th IEEE International Conference on Advanced Video and Signal Based Surveillance.

[11]  Hervé Bredin,et al.  Integer linear programming for speaker diarization and cross-modal identification in TV broadcast , 2013, INTERSPEECH.

[12]  Marie-Francine Moens,et al.  Naming persons in news video with label propagation , 2010, 2010 IEEE International Conference on Multimedia and Expo.

[13]  Sue Tranter Who Really Spoke When? Finding Speaker Turns and Identities in Broadcast News Audio , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[14]  Georges Quénot,et al.  First Workshop on Speech , Language and Audio in Multimedia , 2022 .

[15]  Delphine Charlet,et al.  Scene understanding for identifying persons in TV shows: Beyond face authentication , 2014, 2014 12th International Workshop on Content-Based Multimedia Indexing (CBMI).

[16]  Georges Quénot,et al.  Towards a Better Integration of Written Names for Unsupervised Speakers Identification in Videos , 2013, SLAM@INTERSPEECH.

[17]  Olivier Galibert,et al.  A presentation of the REPERE challenge , 2012, 2012 10th International Workshop on Content-Based Multimedia Indexing (CBMI).

[18]  Philippe Joly,et al.  Audiovisual diarization of people in video content , 2012, Multimedia Tools and Applications.

[19]  Ricky Houghton Named Faces: Putting Names to Faces , 1999, IEEE Intell. Syst..

[20]  Jean-Luc Gauvain,et al.  Speaker diarization from speech transcripts , 2004, INTERSPEECH.

[21]  Anindya Roy,et al.  Person instance graphs for mono-, cross- and multi-modal person recognition in multimedia data: application to speaker identification in TV broadcast , 2014, International Journal of Multimedia Information Retrieval.

[22]  Delphine Charlet,et al.  Improving speaker identification in TV-shows using person name detection in overlaid text and speech , 2013, INTERSPEECH.

[23]  Georges Quénot,et al.  From Text Detection in Videos to Person Identification , 2012, 2012 IEEE International Conference on Multimedia and Expo.

[24]  Georges Quénot,et al.  Unsupervised Speaker Identification using Overlaid Texts in TV Broadcast , 2012, INTERSPEECH.

[25]  Jean-Marc Odobez,et al.  Comparison of two methods for unsupervised person identification in TV shows , 2014, 2014 12th International Workshop on Content-Based Multimedia Indexing (CBMI).

[26]  Georges Quénot,et al.  Nommage non-supervisé des personnes dans les émissions de télévision : une revue du potentiel de chaque modalité , 2014, CORIA.

[27]  Jun Yang,et al.  Naming every individual in news video monologues , 2004, MULTIMEDIA '04.

[28]  Claude Barras,et al.  On the use of GSV-SVM for Speaker Diarization and Tracking , 2010, Odyssey.

[29]  Mickael Rouvier,et al.  A global optimization framework for speaker diarization , 2012, Odyssey.

[30]  Julie Mauclair,et al.  Speaker Diarization: About whom the Speaker is Talking ? , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[31]  Rong Yan,et al.  Multiple instance learning for labeling faces in broadcasting news video , 2005, MULTIMEDIA '05.

[32]  Georges Linarès,et al.  Multimodal understanding for person recognition in video broadcasts , 2014, INTERSPEECH.

[33]  Georges Quénot,et al.  QCompere @ REPERE 2013 , 2013, SLAM@INTERSPEECH.

[34]  Georges Quénot,et al.  Nommage non supervisé des personnes dans les émissions de télévision. Utilisation des noms écrits, des noms prononcés ou des deux ? , 2014, Document Numérique.

[35]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[36]  Georges Quénot,et al.  Unsupervised naming of speakers in broadcast TV: using written names, pronounced names or both? , 2013, INTERSPEECH.

[37]  Sylvain Meignier,et al.  Identification of Speakers by Name Using Belief Functions , 2010, IPMU.

[38]  Cordelia Schmid,et al.  Is that you? Metric learning approaches for face identification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[39]  Marie-Francine Moens,et al.  Naming People in News Videos with Label Propagation , 2011, IEEE MultiMedia.

[40]  Takeo Kanade,et al.  Video OCR: indexing digital news libraries by recognition of superimposed captions , 1999, Multimedia Systems.

[41]  L. Lamel,et al.  A comparative study using manual and automatic transcriptions for diarization , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..