Comparison of two methods for unsupervised person identification in TV shows

We address the task of identifying people appearing in TV shows. The target persons are all people whose identity is said or written, like the journalists and the well known people, as politicians, athletes, celebrities, etc. In our approach, overlaid names displayed on the images are used to identify the persons without any use of biometric models for the speakers and the faces. Two identification methods are evaluated as part of the REPERE French evaluation campaign. The first one relies on co-occurrence times between overlay person names and speaker/face clusters, and rule-based decisions which assign a name to each monomodal cluster. The second method uses a Conditionnal Random Field (CRF) which combine different types of co-occurrence statistics and pair-wised constraints to jointly identify speakers and faces.

[1]  J. Munkres ALGORITHMS FOR THE ASSIGNMENT AND TRANSIORTATION tROBLEMS* , 1957 .

[2]  Andrew Zisserman,et al.  Hello! My name is... Buffy'' -- Automatic Naming of Characters in TV Video , 2006, BMVC.

[3]  Jean-Marc Odobez,et al.  Fusing matching and biometric similarity measures for face diarization in video , 2013, ICMR '13.

[4]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[5]  Mickael Rouvier,et al.  A global optimization framework for speaker diarization , 2012, Odyssey.

[6]  Qingming Huang,et al.  Naming faces in broadcast news video by image google , 2008, ACM Multimedia.

[7]  Olivier Galibert,et al.  A presentation of the REPERE challenge , 2012, 2012 10th International Workshop on Content-Based Multimedia Indexing (CBMI).

[8]  Shih-Fu Chang,et al.  Structured exploration of who, what, when, and where in heterogeneous multimedia news sources , 2013, ACM Multimedia.

[9]  Yee Whye Teh,et al.  Names and faces in the news , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[10]  Georges Quénot,et al.  Unsupervised Speaker Identification using Overlaid Texts in TV Broadcast , 2012, INTERSPEECH.

[11]  Odobez Jean-Marc,et al.  A conditional random field approach for audio-visual people diarization , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Georges Quénot,et al.  Unsupervised naming of speakers in broadcast TV: using written names, pronounced names or both? , 2013, INTERSPEECH.

[13]  Mickael Rouvier,et al.  An open-source state-of-the-art toolbox for broadcast news diarization , 2013, INTERSPEECH.

[14]  Delphine Charlet,et al.  Unsupervised face identification in TV content using audio-visual sources , 2013, 2013 11th International Workshop on Content-Based Multimedia Indexing (CBMI).

[15]  Hervé Bredin,et al.  Integer linear programming for speaker diarization and cross-modal identification in TV broadcast , 2013, INTERSPEECH.

[16]  Marie-Francine Moens,et al.  Naming persons in news video with label propagation , 2010, 2010 IEEE International Conference on Multimedia and Expo.

[17]  Paul Deléglise,et al.  Improvements to the LIUM French ASR system based on CMU sphinx: what helps to significantly reduce the word error rate? , 2009, INTERSPEECH.

[18]  L. Lamel,et al.  A comparative study using manual and automatic transcriptions for diarization , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[19]  Sylvain Meignier,et al.  Automatic named identification of speakers using diarization and ASR systems , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Jean-Luc Gauvain,et al.  Improving Speaker Diarization , 2004 .

[21]  Cordelia Schmid,et al.  Is that you? Metric learning approaches for face identification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[22]  Jean-Marc Odobez,et al.  Face identification from overlaid texts using Local Face Recurrent Patterns and CRF models , 2014, ICIP 2014.

[23]  Patrick Nguyen,et al.  Finding Speaker Identities with a Conditional Maximum Entropy Model , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[24]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[25]  Carole Lailler,et al.  Semi-Supervised and Unsupervised Data Extraction Targeting Speakers: From Speaker Roles to Fame? , 2013, SLAM@INTERSPEECH.

[26]  Takeo Kanade,et al.  Name-It: Naming and Detecting Faces in News Videos , 1999, IEEE Multim..

[27]  Jean-Marc Odobez,et al.  Video text recognition using sequential Monte Carlo and error voting methods , 2005, Pattern Recognit. Lett..

[28]  Sue Tranter Who Really Spoke When? Finding Speaker Turns and Identities in Broadcast News Audio , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[29]  Olivier Galibert,et al.  The ETAPE corpus for the evaluation of speech-based TV content processing in the French language , 2012, LREC.

[30]  Olivier Galibert,et al.  The First Official REPERE Evaluation , 2013, SLAM@INTERSPEECH.