Identification de personnes dans des flux multimédia

This paper describes a multi-modal person recognition system for video broadcast developed for participating to the REPERE challenge, that was organized jointly by the DGA and the ANR (French Research National Agency). The main track of this challenge targets the identification of all persons occurring in a video either. The main scientific issue addressed by this challenge is the combination of audio and video information extraction processes for improving the extraction performance in both modalities. In this paper, we present a strategy for speaker identification based on enriching the speaker diarization by features related to the ”understanding” of the video scenes: text overlay transcription and analysis, automatic situation identification (TV set, report), the amount of people visible, TV set disposition and even the camera when available. Experiments on the REPERE corpus show interest of the proposed approach.

[1]  Georges Quénot,et al.  Automatic Story Segmentation for TV News Video Using Multiple Modalities , 2012, Int. J. Digit. Multim. Broadcast..

[2]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[3]  Haizhou Li,et al.  Modeling Broadcast News Prosody Using Conditional Random Fields for Story Segmentation , 2010 .

[4]  Pascale Sébillot,et al.  Enhancing lexical cohesion measure with confidence measures, semantic relations and language model interpolation for multimedia spoken content topic segmentation , 2012, Comput. Speech Lang..

[5]  Pascale Sébillot,et al.  Text recognition in multimedia documents: a study of two neural-based OCRs using and avoiding character segmentation , 2013, International Journal on Document Analysis and Recognition (IJDAR).

[6]  J. Martinet,et al.  Les histogrammes spatio-temporels pour la ré-identification de personnes dans les journaux télévisés , 2012 .

[7]  Delphine Charlet,et al.  Robust speaker turn role labeling of TV Broadcast News shows , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Delphine Charlet,et al.  Scene understanding for identifying persons in TV shows: Beyond face authentication , 2014, 2014 12th International Workshop on Content-Based Multimedia Indexing (CBMI).

[9]  Georges Linarès,et al.  Combining acoustic name spotting and continuous context models to improve spoken person name recognition in speech , 2013, INTERSPEECH.

[10]  Delphine Charlet,et al.  Impact of overlapping speech detection on speaker diarization for broadcast news and debates , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Andreas Stolcke,et al.  Within-class covariance normalization for SVM-based speaker recognition , 2006, INTERSPEECH.

[12]  Delphine Charlet,et al.  Multiple-view constrained clustering for unsupervised face identification in TV-broadcast , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Cordelia Schmid,et al.  Temporal Localization of Actions with Actoms. , 2013, IEEE transactions on pattern analysis and machine intelligence.

[14]  Sébastien Marcel,et al.  Inter-session variability modelling and joint factor analysis for face authentication , 2011, 2011 International Joint Conference on Biometrics (IJCB).

[15]  Olivier Galibert,et al.  A presentation of the REPERE challenge , 2012, 2012 10th International Workshop on Content-Based Multimedia Indexing (CBMI).

[16]  Georges Linarès,et al.  PERCOLI: A Person Identification System for the 2013 REPERE Challenge , 2013, SLAM@INTERSPEECH.

[17]  Georges Linarès,et al.  The LIA Speech Recognition System: From 10xRT to 1xRT , 2007, TSD.

[18]  Olivier Galibert,et al.  The REPERE Corpus : a multimodal corpus for person recognition , 2012, LREC.

[19]  Gérard Chollet,et al.  Introduction of quality measures in audio-visual identity verification , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Haizhou Li,et al.  ALIZE 3.0 - open source toolkit for state-of-the-art speaker recognition , 2013, INTERSPEECH.