Multimodal person discovery using label propagation over speaking faces graphs

The indexing of large datasets is a task of great importance, since it directly impacts on the quality of information that can be retrieved from these sets. Unfortunately, some datasets are growing in size so fast that manually indexing becomes unfeasible. Automatic indexing techniques can be applied to overcome this issue, and in this study, a unsupervised technique for multimodal person discovery is proposed, which consists in detecting persons that are appearing and speaking simultaneously on a video and associating names to them. To achieve that, the data is modeled as a graph of speaking-faces, and names are extracted via OCR and propagated through the graph based on audiovisual relations between speaking faces. To propagate labels, two graph based methods are proposed, one based on random walks and the other based on a hierarchical approach.In order to assess the proposed approach, we use two graph clustering baselines, and different modality fusion approaches. On the MediaEval MPD 2017 dataset, the proposed label propagation methods outperform all literature methods except one, which uses a different approach on the pre-processing step. Even though the Kappa coefficient indicates that the random walk and the hierarchical label propagation produce highly equivalent results, the hierarchical propagation is more than 6 times faster than the random walk under same configurations.

[1]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[2]  L. Lamel,et al.  A comparative study using manual and automatic transcriptions for diarization , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[3]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Delphine Charlet,et al.  Unsupervised face identification in TV content using audio-visual sources , 2013, 2013 11th International Workshop on Content-Based Multimedia Indexing (CBMI).

[5]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[6]  Jean-Marc Odobez,et al.  Towards large scale multimedia indexing: A case study on person discovery in broadcast news , 2017, CBMI.

[7]  Jean-Marc Odobez,et al.  Comparison of two methods for unsupervised person identification in TV shows , 2014, 2014 12th International Workshop on Content-Based Multimedia Indexing (CBMI).

[8]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[9]  Vinh-Tiep Nguyen,et al.  HCMUS team at the Multimodal Person Discovery in Broadcast TV Task of MediaEval 2016 , 2016, MediaEval.

[10]  Jun Yang,et al.  Naming every individual in news video monologues , 2004, MULTIMEDIA '04.

[11]  Anindya Roy,et al.  Person instance graphs for mono-, cross- and multi-modal person recognition in multimedia data: application to speaker identification in TV broadcast , 2014, International Journal of Multimedia Information Retrieval.

[12]  Mickael Rouvier,et al.  An open-source state-of-the-art toolbox for broadcast news diarization , 2013, INTERSPEECH.

[13]  Guillaume Gravier,et al.  PUCMinas and IRISA at Multimodal Person Discovery , 2016, MediaEval.

[14]  Koichi Shinoda,et al.  Tokyo Tech at MediaEval 2016 Multimodal Person Discovery in Broadcast TV task , 2016, MediaEval.

[15]  Julie Mauclair,et al.  Speaker Diarization: About whom the Speaker is Talking ? , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[16]  William Robson Schwartz,et al.  SSIG and IRISA at Multimodal Person Discovery , 2015, MediaEval.

[17]  Verónica Vilaplana,et al.  UPC System for the 2015 MediaEval Multimodal Person Discovery in Broadcast TV task , 2015, MediaEval.

[18]  Georges Quénot,et al.  Naming multi-modal clusters to identify persons in TV broadcast , 2015, Multimedia Tools and Applications.

[19]  Michael Felsberg,et al.  Accurate Scale Estimation for Robust Visual Tracking , 2014, BMVC.

[20]  Carmen García-Mateo,et al.  GTM-UVigo System for Multimodal Person Discovery in Broadcast TV Task at MediaEval 2016 , 2016, MediaEval.

[21]  Sophie Rosset,et al.  Person Instance Graphs for Named Speaker Identification in TV Broadcast , 2014, Odyssey.

[22]  Guillaume Gravier,et al.  Tag Propagation Approaches within Speaking Face Graphs for Multimodal Person Discovery , 2017, CBMI.

[23]  Ricky Houghton Named Faces: Putting Names to Faces , 1999, IEEE Intell. Syst..

[24]  Christian Raymond Robust tree-structured Named Entities Recognition from speech , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25]  Georges Quénot,et al.  Unsupervised Speaker Identification in TV Broadcast Based on Written Names , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26]  Jean-Luc Gauvain,et al.  Speaker diarization from speech transcripts , 2004, INTERSPEECH.

[27]  Olivier Galibert,et al.  The First Official REPERE Evaluation , 2013, SLAM@INTERSPEECH.

[28]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[29]  Sue Tranter Who Really Spoke When? Finding Speaker Turns and Identities in Broadcast News Audio , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[30]  Takeo Kanade,et al.  Name-It: Naming and Detecting Faces in News Videos , 1999, IEEE Multim..

[31]  Benjamin Perret,et al.  Evaluation of Morphological Hierarchies for Supervised Segmentation , 2015, ISMM.

[32]  Jean-Marc Odobez,et al.  Video text recognition using sequential Monte Carlo and error voting methods , 2005, Pattern Recognit. Lett..

[33]  Andrew Zisserman,et al.  Hello! My name is... Buffy'' -- Automatic Naming of Characters in TV Video , 2006, BMVC.

[34]  Marie-Francine Moens,et al.  Naming People in News Videos with Label Propagation , 2011, IEEE MultiMedia.

[35]  Claude Barras,et al.  Multimodal Person Discovery in Broadcast TV at MediaEval 2016 , 2015, MediaEval.

[36]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[37]  Jean-Marc Odobez,et al.  EUMSSI team at the MediaEval Person Discovery Challenge , 2015, MediaEval.

[38]  Paul Deléglise,et al.  Extracting true speaker identities from transcriptions , 2007, INTERSPEECH.

[39]  Olivier Galibert,et al.  A presentation of the REPERE challenge , 2012, 2012 10th International Workshop on Content-Based Multimedia Indexing (CBMI).

[40]  Georges Linarès,et al.  Multimodal understanding for person recognition in video broadcasts , 2014, INTERSPEECH.

[41]  Rong Yan,et al.  Multiple instance learning for labeling faces in broadcasting news video , 2005, MULTIMEDIA '05.

[42]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[43]  Delphine Charlet,et al.  Scene understanding for identifying persons in TV shows: Beyond face authentication , 2014, 2014 12th International Workshop on Content-Based Multimedia Indexing (CBMI).