Unsupervised face identification in TV content using audio-visual sources

Our goal is to automatically identify faces in TV content without pre-defined dictionary of identities. Most of methods are based on identity detection (from OCR and ASR) and require a propagation strategy based on visual clusterings. In TV content, people appear with many variation making the clustering very difficult. In this case, identifying speakers can be a reliable link to identify faces. In this work, we propose to combine reliable unsupervised face and speaker identification systems through talking-faces detection in order to improve face identification results. First, OCR and ASR results are combined to extract locally the identities. Then, the reliable visual associations are used to propagate those identities locally. The reliable identified faces are used as unsupervised models to identify similar faces. Finally speaker identities are propagated to the faces in case of lip activity detection. Experiments performed on the REPERE database show an improvement of the recall of +5% compared to the baseline, without degrading the precision.

[1]  Lidia Mangu,et al.  Finding consensus in speech recognition , 2000 .

[2]  Georges Quénot,et al.  Fusion of Speech, Faces and Text for Person Identification in TV Broadcast , 2012, ECCV Workshops.

[3]  Ron Bekkerman,et al.  Multi-modal Clustering for Multimedia Collections , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Georges Linarès,et al.  The LIA Speech Recognition System: From 10xRT to 1xRT , 2007, TSD.

[5]  Sue Tranter Who Really Spoke When? Finding Speaker Turns and Identities in Broadcast News Audio , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[6]  Pinar Duygulu Sahin,et al.  A Graph Based Approach for Naming Faces in News Photos , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[7]  J. Martinet,et al.  Les histogrammes spatio-temporels pour la ré-identification de personnes dans les journaux télévisés , 2012 .

[8]  Vladimir Kolmogorov,et al.  "GrabCut": interactive foreground extraction using iterated graph cuts , 2004, ACM Trans. Graph..

[9]  Frédéric Béchet,et al.  Detecting person presence in TV shows with linguistic and structural features , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Christophe Garcia,et al.  text Detection with Convolutional Neural Networks , 2008, VISAPP.

[11]  Paul A. Viola,et al.  Robust Real-time Object Detection , 2001 .

[12]  Georges Quénot,et al.  Unsupervised Speaker Identification using Overlaid Texts in TV Broadcast , 2012, INTERSPEECH.

[13]  Andrew Blake,et al.  "GrabCut" , 2004, ACM Trans. Graph..

[14]  Feifan Liu,et al.  Identification of Soundbite and Its Speaker Name Using Transcripts of Broadcast News Speech , 2010, TALIP.

[15]  B. Taskar,et al.  Learning from ambiguously labeled images , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Sylvain Meignier,et al.  Automatic named identification of speakers using diarization and ASR systems , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Takeo Kanade,et al.  Name-It: association of face and name in video , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[18]  Andreas Stolcke,et al.  Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[19]  Gérard Chollet,et al.  Talking faces indexing in TV-content , 2010, 2010 International Workshop on Content Based Multimedia Indexing (CBMI).

[20]  Andrew Zisserman,et al.  Taking the bite out of automated naming of characters in TV video , 2009, Image Vis. Comput..

[21]  Jean-Luc Gauvain,et al.  Multistage speaker diarization of broadcast news , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[23]  Qingming Huang,et al.  Naming faces in broadcast news video by image google , 2008, ACM Multimedia.

[24]  Christoph H. Lampert,et al.  Correlational spectral clustering , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Cordelia Schmid,et al.  Automatic face naming with caption-based supervision , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Olivier Galibert,et al.  A presentation of the REPERE challenge , 2012, 2012 10th International Workshop on Content-Based Multimedia Indexing (CBMI).