CRF-Based Context Modeling for Person Identification in Broadcast Videos

We investigate the problem of speaker and face identification in broadcast videos. Identification is performed by associating automatically extracted names from overlaid texts with speaker and face clusters. We aimed at exploiting the structure of news videos to solve name/cluster association ambiguities and clustering errors. The proposed approach combines iteratively two Conditional Random Fields (CRF). The first CRF performs the person diarization (joint temporal segmentation, clustering and association of voices and faces) jointly over the speech segments and the face tracks. It benefits from contextual information extracted from the image backgrounds and the overlaid texts. The second CRF associates names with person clusters thanks to co-occurrence statistics. Experiments conducted on a recent and substantial public dataset containing reports and debates demonstrate the interest and complementarity of the different modeling steps and information sources: the use of those elements enables us to obtain better performances in clustering and identification, especially in studio scenes.

[1]  Ben Taskar,et al.  Learning from Partial Labels , 2011, J. Mach. Learn. Res..

[2]  Slim Essid,et al.  A Multimodal Approach to Speaker Diarization on TV Talk-Shows , 2013, IEEE Transactions on Multimedia.

[3]  Andrew Zisserman,et al.  Hello! My name is... Buffy'' -- Automatic Naming of Characters in TV Video , 2006, BMVC.

[4]  Jean-Marc Odobez,et al.  Fusing matching and biometric similarity measures for face diarization in video , 2013, ICMR '13.

[5]  Julee Cobb,et al.  Hello, My Name Is… , 2016 .

[6]  Neil A. Dodgson,et al.  Proceedings Ninth IEEE International Conference on Computer Vision , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[7]  Jean-Marc Odobez,et al.  Comparison of two methods for unsupervised person identification in TV shows , 2014, 2014 12th International Workshop on Content-Based Multimedia Indexing (CBMI).

[8]  Ameneh Boroomand Conditional Random Field , 2010, Encyclopedia of Machine Learning.

[9]  Gang Wei,et al.  Person identification in TV programs , 2001, J. Electronic Imaging.

[10]  Andrew Zisserman,et al.  Fisher Vector Faces in the Wild , 2013, BMVC.

[11]  Horst Bischof,et al.  Multiple Instance Boosting for Face Recognition in Videos , 2011, DAGM-Symposium.

[12]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[13]  Takeo Kanade,et al.  Name-It: Naming and Detecting Faces in News Videos , 1999, IEEE Multim..

[14]  Cordelia Schmid,et al.  Multiple Instance Metric Learning from Automatically Labeled Bags of Faces , 2010, ECCV.

[15]  Olivier Galibert,et al.  The REPERE Corpus : a multimodal corpus for person recognition , 2012, LREC.

[16]  Philippe Joly,et al.  Audiovisual diarization of people in video content , 2012, Multimedia Tools and Applications.

[17]  Yee Whye Teh,et al.  Names and faces in the news , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[18]  Miki Haseyama,et al.  [Foreword] Welcome to the Transactions on Media Technology and Applications:The Institute of Image Information and Television Engineers (ITE) has decided to launch a new open access journal, titled "Media Technology and Applications" (MTA). , 2013 .

[19]  Jean-Marc Odobez,et al.  A conditional random field approach for face identification in broadcast news using overlaid text , 2014, ICIP.

[20]  Pinar Duygulu Sahin,et al.  Interesting faces: A graph-based approach for finding people in news , 2010, Pattern Recognit..

[21]  Mickael Rouvier,et al.  An open-source state-of-the-art toolbox for broadcast news diarization , 2013, INTERSPEECH.

[22]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[23]  Odobez Jean-Marc,et al.  A conditional random field approach for face identification in broadcast news using overlaid text , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[24]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Patrick Pérez,et al.  Some Faces are More Equal than Others: Hierarchical Organization for Accurate and Efficient Large-Scale Identity-Based Face Retrieval , 2014, ECCV Workshops.

[26]  Jean-Marc Odobez,et al.  A conditional random field approach for audio-visual people diarization , 2014, ICASSP.

[27]  Jean-Marc Odobez,et al.  Video text recognition using sequential Monte Carlo and error voting methods , 2005, Pattern Recognit. Lett..

[28]  Marie-Francine Moens,et al.  Naming persons in video: Using the weak supervision of textual stories , 2013, J. Vis. Commun. Image Represent..

[29]  Georges Quénot,et al.  Unsupervised Speaker Identification in TV Broadcast Based on Written Names , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[30]  Hervé Bredin,et al.  Integer linear programming for speaker diarization and cross-modal identification in TV broadcast , 2013, INTERSPEECH.

[31]  Shih-Fu Chang,et al.  Structured exploration of who, what, when, and where in heterogeneous multimedia news sources , 2013, ACM Multimedia.

[32]  Sylvain Meignier,et al.  Automatic named identification of speakers using diarization and ASR systems , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[33]  Volume Assp,et al.  ACOUSTICS. SPEECH. AND SIGNAL PROCESSING , 1983 .

[34]  Georges Quénot,et al.  Naming multi-modal clusters to identify persons in TV broadcast , 2015, Multimedia Tools and Applications.

[35]  Liyan Zhang,et al.  A unified framework for context assisted face clustering , 2013, ICMR '13.

[36]  Cordelia Schmid,et al.  Unsupervised metric learning for face identification in TV video , 2011, 2011 International Conference on Computer Vision.

[37]  Patrick Nguyen,et al.  Finding Speaker Identities with a Conditional Maximum Entropy Model , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[38]  Marie-Francine Moens,et al.  Linking names and faces: seeing the problem in different ways , 2008, ECCV 2008.

[39]  Ben Taskar,et al.  Talking pictures: Temporal grouping and dialog-supervised person recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[40]  Gwenn Englebienne,et al.  Multimodal Speaker Diarization , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Rainer Stiefelhagen,et al.  Semi-supervised Learning with Constraints for Person Identification in Multimedia Data , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Philippe Joly,et al.  Face-and-clothing based people clustering in video content , 2010, MIR '10.

[43]  Frédéric Bimbot,et al.  Speaker diarization using bottom-up clustering based on a parameter-derived distance between adapted GMMs , 2004, INTERSPEECH.

[44]  Mickael Rouvier,et al.  A global optimization framework for speaker diarization , 2012, Odyssey.

[45]  Georges Linarès,et al.  Multimodal understanding for person recognition in video broadcasts , 2014, INTERSPEECH.

[46]  Georges Quénot,et al.  QCompere @ REPERE 2013 , 2013, SLAM@INTERSPEECH.

[47]  Ning Zhang,et al.  Beyond frontal faces: Improving Person Recognition using multiple cues , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Eric Sommerlade,et al.  Total Cluster: A person agnostic clustering method for broadcast videos , 2014, ICVGIP '14.

[49]  Odobez Jean-Marc,et al.  A conditional random field approach for audio-visual people diarization , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).