Identification non-supervisée de personnes dans les flux télévisés. (Unsupervised person recognition in TV broadcast)

Ce travail de these a pour objectif de proposer plusieurs methodes d'identification non-supervisees des personnes presentes dans les flux televises a l'aide des noms ecrits a l'ecran. Comme l'utilisation de modeles biometriques pour reconnaitre les personnes presentes dans de larges collections de videos est une solution peu viable sans connaissance a priori des personnes a identifier, plusieurs methodes de l'etat de l'art proposent d'employer d'autres sources d'informations pour obtenir le nom des personnes presentes. Ces methodes utilisent principalement les noms prononces comme source de noms. Cependant, on ne peut avoir qu'une faible confiance dans cette source en raison des erreurs de transcription ou de detection des noms et aussi a cause de la difficulte de savoir a qui fait reference un nom prononce. Les noms ecrits a l'ecran dans les emissions de television ont ete peu utilises en raison de la difficulte a extraire ces noms dans des videos de mauvaise qualite. Toutefois, ces dernieres annees ont vu l'amelioration de la qualite des videos et de l'incrustation des textes a l'ecran. Nous avons donc re-evalue, dans cette these, l'utilisation de cette source de noms. Nous avons d'abord developpe LOOV (pour Lig Overlaid OCR in Video), un outil d'extraction des textes sur-imprimes a l'image dans les videos. Nous obtenons avec cet outil un taux d'erreur en caracteres tres faible. Ce qui nous permet d'avoir une confiance importante dans cette source de noms. Nous avons ensuite compare les noms ecrits et les noms prononces dans leurs capacites a fournir le nom des personnes presentes dans les emissions de televisions. Il en est ressorti que deux fois plus de personnes sont nommables par les noms ecrits que par les noms prononces extraits automatiquement. Un autre point important a noter est que l'association entre un nom et une personne est intrinsequement plus simple pour les noms ecrits que pour les noms prononces. Cette tres bonne source de noms nous a donc permis de developper plusieurs methodes de nommage non-supervise des personnes presentes dans les emissions de television. Nous avons commence par des methodes de nommage tardives ou les noms sont propages sur des clusters de locuteurs. Ces methodes remettent plus ou moins en cause les choix fait lors du processus de regroupement des tours de parole en clusters de locuteurs. Nous avons ensuite propose deux methodes (le nommage integre et le nommage precoce) qui integrent de plus en plus l'information issue des noms ecrits pendant le processus de regroupement. Pour identifier les personnes visibles, nous avons adapte la methode de nommage precoce pour des clusters de visages. Enfin, nous avons aussi montre que cette methode fonctionne aussi pour nommer des clusters multi-modaux voix-visage. Avec cette derniere methode, qui nomme au cours d'un unique processus les tours de paroles et les visages, nous obtenons des resultats comparables aux meilleurs systemes ayant concouru durant la premiere campagne d'evaluation REPERE

[1]  Qingming Huang,et al.  A New Text Detection Algorithm in Images/Video Frames , 2004, PCM.

[2]  Ricky Houghton Named Faces: Putting Names to Faces , 1999, IEEE Intell. Syst..

[3]  Jean-Luc Gauvain,et al.  The LIMSI Broadcast News transcription system , 2002, Speech Commun..

[4]  Changsheng Xu,et al.  Automatic character identification in feature-length films , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[5]  Jean-Luc Gauvain,et al.  Speaker diarization from speech transcripts , 2004, INTERSPEECH.

[6]  Horst Bischof,et al.  Learning to recognize faces from videos and weakly related information cues , 2011, 2011 8th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[7]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[8]  François Yvon,et al.  Practical Very Large Scale CRFs , 2010, ACL.

[9]  Changsheng Xu,et al.  Naming faces in films using hypergraph matching , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[10]  Jean-Michel Jolion,et al.  Text localization, enhancement and binarization in multimedia documents , 2002, Object recognition supported by user interaction for service robots.

[11]  Xian-Sheng Hua,et al.  Automatic location of text in video frames , 2001, MULTIMEDIA '01.

[12]  Yongdong Zhang,et al.  Confusion network based Video OCR post-processing approach , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[13]  Paul Deléglise,et al.  Extracting true speaker identities from transcriptions , 2007, INTERSPEECH.

[14]  Olivier Galibert,et al.  The LIMSI Participation in the QAst 2009 Track: Experimenting on Answer Scoring , 2009, CLEF.

[15]  L. Lamel,et al.  A comparative study using manual and automatic transcriptions for diarization , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[16]  Ching-Yung Lin,et al.  Cross-Modality Automatic Face Model Training from Large Video Databases , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[17]  Rainer Stiefelhagen,et al.  Semi-supervised Learning with Constraints for Person Identification in Multimedia Data , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[19]  Elie el Khoury,et al.  Combining transcription-based and acoustic-based speaker identifications for broadcast news , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Tao Tao,et al.  A formal study of information retrieval heuristics , 2004, SIGIR '04.

[21]  Cordelia Schmid,et al.  Face recognition from caption-based supervision , 2010 .

[22]  Rolf Ingold,et al.  A HMM-Based Approach to Recognize Ultra Low Resolution Anti-Aliased Words , 2007, PReMI.

[23]  Stéphane Ayache,et al.  Speaker Identity Indexing In Audio-Visual Documents , 2005 .

[24]  Takeo Kanade,et al.  Name-It: association of face and name in video , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[25]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[26]  Alexander G. Hauptmann,et al.  Searching for a specific person in broadcast news video , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[27]  Yannick Estève,et al.  Reconnaissance Automatique de Locuteurs à l'aide de Fonctions de Croyance , 2010 .

[28]  Georges Quénot,et al.  Unsupervised Speaker Identification using Overlaid Texts in TV Broadcast , 2012, INTERSPEECH.

[29]  Patrick Nguyen,et al.  Finding Speaker Identities with a Conditional Maximum Entropy Model , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[30]  Alexandre Allauzen,et al.  Training and Evaluation of POS Taggers on the French MULTITAG Corpus , 2008, LREC.

[31]  Delphine Charlet,et al.  Unsupervised face identification in TV content using audio-visual sources , 2013, 2013 11th International Workshop on Content-Based Multimedia Indexing (CBMI).

[32]  Hiroshi Murase,et al.  Name Identification of People in News Video by Face Matching , 2007 .

[33]  Hervé Bredin,et al.  Integer linear programming for speaker diarization and cross-modal identification in TV broadcast , 2013, INTERSPEECH.

[34]  Marie-Francine Moens,et al.  Naming persons in news video with label propagation , 2010, 2010 IEEE International Conference on Multimedia and Expo.

[35]  Marie-Francine Moens,et al.  Naming People in News Videos with Label Propagation , 2011, IEEE MultiMedia.

[36]  Erica Klarreich,et al.  Hello, my name is… , 2014, CACM.

[37]  Sylvain Meignier,et al.  Identification of Speakers by Name Using Belief Functions , 2010, IPMU.

[38]  Claude Barras,et al.  On the use of GSV-SVM for Speaker Diarization and Tracking , 2010, Odyssey.

[39]  Mickael Rouvier,et al.  A global optimization framework for speaker diarization , 2012, Odyssey.

[40]  Qingming Huang,et al.  Naming faces in broadcast news video by image google , 2008, ACM Multimedia.

[41]  Rong Yan,et al.  Multiple instance learning for labeling faces in broadcasting news video , 2005, MULTIMEDIA '05.

[42]  Douglas A. Reynolds,et al.  Blind clustering of speech utterances based on speaker and language characteristics , 1998, ICSLP.

[43]  B. Taskar,et al.  Learning from ambiguously labeled images , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Sophie Rosset,et al.  Models Cascade for Tree-Structured Named Entity Detection , 2011, IJCNLP.

[45]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[46]  Andreas Ernst,et al.  Face detection with the modified census transform , 2004, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[47]  Jun Yang,et al.  Finding Person X: Correlating Names with Visual Appearances , 2004, CIVR.

[48]  Sylvain Meignier,et al.  Automatic named identification of speakers using diarization and ASR systems , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[49]  Xian-Sheng Hua,et al.  A video text detection and recognition system , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[50]  Ben Taskar,et al.  Talking pictures: Temporal grouping and dialog-supervised person recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[51]  Andrew Zisserman,et al.  Hello! My name is... Buffy'' -- Automatic Naming of Characters in TV Video , 2006, BMVC.

[52]  Changsheng Xu,et al.  Robust Face-Name Graph Matching for Movie Character Identification , 2012, IEEE Transactions on Multimedia.

[53]  Julie Mauclair,et al.  Indexation en locuteur : utilisation d'informations lexicales , 2006 .

[54]  Ben Taskar,et al.  Learning from Partial Labels , 2011, J. Mach. Learn. Res..

[55]  Jean-Luc Gauvain,et al.  Partitioning and transcription of broadcast news data , 1998, ICSLP.

[56]  Takeo Kanade,et al.  Name-It: Naming and Detecting Faces in Video by the Integration of Image and Natural Language Processing , 1997, IJCAI.

[57]  Horst Bischof,et al.  Multiple Instance Boosting for Face Recognition in Videos , 2011, DAGM-Symposium.

[58]  Michael E. Houle,et al.  A Generic Query-Based Model for Scalable Clustering , 2006 .

[59]  Pinar Duygulu Sahin,et al.  Finding People Frequently Appearing in News , 2006, CIVR.

[60]  Horst Bischof,et al.  Learning Face Recognition in Videos from Associated Information Sources , 2011 .

[61]  Andrew Zisserman,et al.  “Who are you?” - Learning person specific classifiers from video , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[62]  David A. van Leeuwen,et al.  Diarization-Based Speaker Retrieval for Broadcast Television Archives , 2011, INTERSPEECH.

[63]  Rohit Prasad,et al.  Multi-frame combination for robust videotext recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[64]  J. Franklin,et al.  The elements of statistical learning: data mining, inference and prediction , 2005 .

[65]  Y. Estève,et al.  Etude pour l’amélioration d’un système d’identification nommée du locuteur , 2008 .

[66]  Georges Quénot,et al.  Towards a Better Integration of Written Names for Unsupervised Speakers Identification in Videos , 2013, SLAM@INTERSPEECH.

[67]  Christopher D. Manning,et al.  Enforcing Transitivity in Coreference Resolution , 2008, ACL.

[68]  Georges Quénot,et al.  Text detection and recognition for person identification in videos , 2011, 2011 9th International Workshop on Content-Based Multimedia Indexing (CBMI).

[69]  Meriem Bendris Indexation audio-visuelle des personnes dans un contexte de télévision. (Audio-visual indexing of people in TV-context) , 2011 .

[70]  Mickael Rouvier,et al.  I-vectors and ILP clustering adapted to cross-show speaker diarization , 2012, INTERSPEECH.

[71]  Paul Deléglise,et al.  The LIUM speech transcription system: a CMU Sphinx III-based system for French broadcast news , 2005, INTERSPEECH.

[72]  Philippe Joly,et al.  Audiovisual diarization of people in video content , 2012, Multimedia Tools and Applications.

[73]  Andrew Zisserman,et al.  Taking the bite out of automated naming of characters in TV video , 2009, Image Vis. Comput..

[74]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[75]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[76]  Wolfgang Effelsberg,et al.  Automatic text segmentation and text recognition for video indexing , 2000, Multimedia Systems.

[77]  Nordine Fourour Identification et catégorisation automatique des entités nommées dans les textes français , 2004 .

[78]  Georges Quénot,et al.  Unsupervised naming of speakers in broadcast TV: using written names, pronounced names or both? , 2013, INTERSPEECH.

[79]  Rainer Stiefelhagen,et al.  “Knock! Knock! Who is it?” probabilistic person identification in TV-series , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[80]  Frédéric Béchet,et al.  Detecting person presence in TV shows with linguistic and structural features , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[81]  G. Chollet,et al.  LIP ACTIVITY DETECTION FOR TALKING FACES CLASSIFICATION IN TV-CONTENT , 2010 .

[82]  Johann Poignant,et al.  Détection et reconnaissance de texte dans les documents vidéos. Et leurs apports à la reconnaissance de personnes , 2011, CORIA.

[83]  Georges Quénot,et al.  Nommage non-supervisé des personnes dans les émissions de télévision : une revue du potentiel de chaque modalité , 2014, CORIA.

[84]  Jun Yang,et al.  Naming every individual in news video monologues , 2004, MULTIMEDIA '04.

[85]  Takeo Kanade,et al.  Name-It: Naming and Detecting Faces in News Videos , 1999, IEEE Multim..

[86]  Olivier Galibert,et al.  The REPERE Corpus : a multimodal corpus for person recognition , 2012, LREC.

[87]  Rainer Stiefelhagen,et al.  Multi-pose Face Recognition for Person Retrieval in Camera Networks , 2010, 2010 7th IEEE International Conference on Advanced Video and Signal Based Surveillance.

[88]  Michael R. Lyu,et al.  A new approach for video text detection , 2002, Proceedings. International Conference on Image Processing.

[89]  Duy-Dinh Le,et al.  Finding Important People in Large News Video Databases Using Multimodal and Clustering Analysis , 2007, 2007 IEEE 23rd International Conference on Data Engineering Workshop.

[90]  Ngoc Thang Vu,et al.  Speech recognition for machine translation in Quaero , 2011, IWSLT.

[91]  Julie Mauclair,et al.  Speaker Diarization: About whom the Speaker is Talking ? , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[92]  Sue Tranter Who Really Spoke When? Finding Speaker Turns and Identities in Broadcast News Audio , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[93]  Changsheng Xu,et al.  Character Identification in Feature-Length Films Using Global Face-Name Matching , 2009, IEEE Transactions on Multimedia.

[94]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[95]  Ioannis Pratikakis,et al.  A two-stage scheme for text detection in video images , 2010, Image Vis. Comput..

[96]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[97]  Vincent Jousse Identification nommée du locuteur : exploitation conjointe du signal sonore et de sa transcription. (Named identification of speakers : using audio signal and rich transcription) , 2011 .

[98]  Changsheng Xu,et al.  Robust movie character identification and the sensitivity analysis , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[99]  Horst Bunke,et al.  On a relation between graph edit distance and maximum common subgraph , 1997, Pattern Recognit. Lett..

[100]  Qifeng Liu,et al.  A stroke filter and its application to text localization , 2009, Pattern Recognit. Lett..

[101]  Georges Quénot,et al.  From Text Detection in Videos to Person Identification , 2012, 2012 IEEE International Conference on Multimedia and Expo.

[102]  Jean-Luc Gauvain,et al.  Multistage speaker diarization of broadcast news , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[103]  Georges Quénot,et al.  CLIPS at TRECVID : Shot Boundary Detection and Feature Detection , 2003, TRECVID.

[104]  Matti Pietikäinen,et al.  Adaptive document image binarization , 2000, Pattern Recognit..

[105]  Christophe Charle,et al.  Liste des tableaux , 1988 .

[106]  Georges Quénot,et al.  Fusion of Speech, Faces and Text for Person Identification in TV Broadcast , 2012, ECCV Workshops.

[107]  Yannick Estève,et al.  Analyse conjointe du signal sonore et de sa transcription pour l'identification nommée de locuteurs , 2009 .