Multimodal person discovery in broadcast TV: lessons learned from MediaEval 2015

We describe the “Multimodal Person Discovery in Broadcast TV” task of MediaEval 2015 benchmarking initiative. Participants were asked to return the names of people who can be both seen as well as heard in every shot of a collection of videos. The list of people was not known a priori and their names had to be discovered in an unsupervised way from media content using text overlay or speech transcripts. The task was evaluated using information retrieval metrics, based on a posteriori collaborative annotation of the test corpus. The first edition of the task gathered 9 teams which submitted 34 runs. This paper provides quantitative and qualitative comparisons of participants submissions. We also investigate why all systems failed for particular shots, paving the way for future promising research directions.

[1]  Koichi Shinoda,et al.  Combining Audio Features and Visual I-Vector @ MediaEval 2015 Multimodal Person Discovery in Broadcast TV , 2015, MediaEval.

[2]  Georges Quénot,et al.  LIG at MediaEval 2015 Multimodal Person Discovery in Broadcast TV Task , 2015, MediaEval.

[3]  Slim Essid,et al.  A Multimodal Approach to Speaker Diarization on TV Talk-Shows , 2013, IEEE Transactions on Multimedia.

[4]  Václav Hlavác,et al.  Detector of Facial Landmarks Learned by the Structured Output SVM , 2012, VISAPP.

[5]  Georges Quénot,et al.  Unsupervised Speaker Identification in TV Broadcast Based on Written Names , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  Thomas Tamisier,et al.  Benchmarking multimedia technologies with the CAMOMILE platform: the case of Multimodal Person Discovery at MediaEval 2015 , 2016, LREC.

[7]  Anindya Roy,et al.  Person instance graphs for mono-, cross- and multi-modal person recognition in multimedia data: application to speaker identification in TV broadcast , 2014, International Journal of Multimedia Information Retrieval.

[8]  Emine Yilmaz,et al.  Estimating average precision with incomplete and imperfect judgments , 2006, CIKM '06.

[9]  Sophie Rosset,et al.  Models Cascade for Tree-Structured Named Entity Detection , 2011, IJCNLP.

[10]  Takeo Kanade,et al.  Video OCR: indexing digital news libraries by recognition of superimposed captions , 1999, Multimedia Systems.

[11]  L. Lamel,et al.  A comparative study using manual and automatic transcriptions for diarization , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[12]  Georges Quénot,et al.  The CAMOMILE Collaborative Annotation Platform for Multi-modal, Multi-lingual and Multi-media Documents , 2016, LREC.

[13]  Marie-Francine Moens,et al.  Naming People in News Videos with Label Propagation , 2011, IEEE MultiMedia.

[14]  Philippe Joly,et al.  Audiovisual diarization of people in video content , 2012, Multimedia Tools and Applications.

[15]  Václav Hlavác,et al.  Facial Landmarks Detector Learned by the Structured Output SVM , 2012, VISIGRAPP.

[16]  Jean-Marc Odobez,et al.  Comparison of two methods for unsupervised person identification in TV shows , 2014, 2014 12th International Workshop on Content-Based Multimedia Indexing (CBMI).

[17]  Georges Quénot,et al.  Nommage non-supervisé des personnes dans les émissions de télévision : une revue du potentiel de chaque modalité , 2014, CORIA.

[18]  Georges Quénot,et al.  Naming multi-modal clusters to identify persons in TV broadcast , 2015, Multimedia Tools and Applications.

[19]  Jun Yang,et al.  Naming every individual in news video monologues , 2004, MULTIMEDIA '04.

[20]  Sylvain Meignier,et al.  Automatic named identification of speakers using diarization and ASR systems , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Frédéric Béchet,et al.  PERCOLATTE : A Multimodal Person Discovery System in TV Broadcast for the Medieval 2015 Evaluation Campaign , 2015, MediaEval.

[22]  Georges Quénot,et al.  Towards a Better Integration of Written Names for Unsupervised Speakers Identification in Videos , 2013, SLAM@INTERSPEECH.

[23]  Sylvain Meignier,et al.  Automatic named identification of speakers using belief functions , 2010, IPMU 2010.

[24]  Olivier Galibert,et al.  A presentation of the REPERE challenge , 2012, 2012 10th International Workshop on Content-Based Multimedia Indexing (CBMI).

[25]  Ricky Houghton Named Faces: Putting Names to Faces , 1999, IEEE Intell. Syst..

[26]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[27]  Geoffrey Zweig,et al.  Achieving Human Parity in Conversational Speech Recognition , 2016, ArXiv.

[28]  Delphine Charlet,et al.  Robust speaker turn role labeling of TV Broadcast News shows , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[30]  Georges Quénot,et al.  Nommage non supervisé des personnes dans les émissions de télévision. Utilisation des noms écrits, des noms prononcés ou des deux ? , 2014, Document Numérique.

[31]  Georges Quénot,et al.  Fusion of Speech, Faces and Text for Person Identification in TV Broadcast , 2012, ECCV Workshops.

[32]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[33]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[34]  Jean-Luc Gauvain,et al.  Speaker diarization from speech transcripts , 2004, INTERSPEECH.

[35]  Delphine Charlet,et al.  Scene understanding for identifying persons in TV shows: Beyond face authentication , 2014, 2014 12th International Workshop on Content-Based Multimedia Indexing (CBMI).

[36]  Georges Quénot,et al.  Unsupervised naming of speakers in broadcast TV: using written names, pronounced names or both? , 2013, INTERSPEECH.

[37]  Mickael Rouvier,et al.  An open-source state-of-the-art toolbox for broadcast news diarization , 2013, INTERSPEECH.

[38]  Ngoc Thang Vu,et al.  Speech recognition for machine translation in Quaero , 2011, IWSLT.

[39]  Julie Mauclair,et al.  Speaker Diarization: About whom the Speaker is Talking ? , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[40]  William Robson Schwartz,et al.  SSIG and IRISA at Multimodal Person Discovery , 2015, MediaEval.

[41]  Delphine Charlet,et al.  Unsupervised face identification in TV content using audio-visual sources , 2013, 2013 11th International Workshop on Content-Based Multimedia Indexing (CBMI).

[42]  Hervé Bredin,et al.  Integer linear programming for speaker diarization and cross-modal identification in TV broadcast , 2013, INTERSPEECH.

[43]  Marie-Francine Moens,et al.  Naming persons in news video with label propagation , 2010, 2010 IEEE International Conference on Multimedia and Expo.

[44]  Verónica Vilaplana,et al.  UPC System for the 2015 MediaEval Multimodal Person Discovery in Broadcast TV task , 2015, MediaEval.

[45]  Rong Yan,et al.  Multiple instance learning for labeling faces in broadcasting news video , 2005, MULTIMEDIA '05.

[46]  Georges Linarès,et al.  Multimodal understanding for person recognition in video broadcasts , 2014, INTERSPEECH.

[47]  Georges Quénot,et al.  QCompere @ REPERE 2013 , 2013, SLAM@INTERSPEECH.

[48]  Takeo Kanade,et al.  Name-It: Naming and Detecting Faces in News Videos , 1999, IEEE Multim..

[49]  Olivier Galibert,et al.  The REPERE Corpus : a multimodal corpus for person recognition , 2012, LREC.

[50]  Sue Tranter Who Really Spoke When? Finding Speaker Turns and Identities in Broadcast News Audio , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[51]  Georges Linarès,et al.  PERCOLI: A Person Identification System for the 2013 REPERE Challenge , 2013, SLAM@INTERSPEECH.

[52]  Jean-Marc Odobez,et al.  EUMSSI team at the MediaEval Person Discovery Challenge , 2015, MediaEval.

[53]  Paul Deléglise,et al.  Extracting true speaker identities from transcriptions , 2007, INTERSPEECH.

[54]  Cordelia Schmid,et al.  Face recognition from caption-based supervision , 2010 .

[55]  Sophie Rosset,et al.  Person Instance Graphs for Named Speaker Identification in TV Broadcast , 2014, Odyssey.

[56]  Elisardo González-Agulla,et al.  GTM-UVigo Systems for Person Discovery Task at MediaEval 2015 , 2015, MediaEval.

[57]  Claude Barras,et al.  LIMSI at MediaEval 2015: Person Discovery in Broadcast TV Task , 2015, MediaEval.

[58]  Georges Quénot,et al.  Unsupervised Speaker Identification using Overlaid Texts in TV Broadcast , 2012, INTERSPEECH.

[59]  Sylvain Meignier,et al.  Identification of Speakers by Name Using Belief Functions , 2010, IPMU.

[60]  Olivier Galibert,et al.  The First Official REPERE Evaluation , 2013, SLAM@INTERSPEECH.

[61]  Paul Over,et al.  High-level feature detection from video in TRECVid: a 5-year retrospective of achievements , 2009 .

[62]  Georges Quénot,et al.  From Text Detection in Videos to Person Identification , 2012, 2012 IEEE International Conference on Multimedia and Expo.