A conditional random field approach for audio-visual people diarization

We investigate the problem of audio-visual (AV) person diarization in broadcast data. That is, automatically associate the faces and voices of people and determine when they appear or speak in the video. The contributions are twofolds. First, we formulate the problem within a novel CRF framework that simultaneously performs the AV association of voices and face clusters to build AV person models, and the joint segmentation of the audio and visual streams using a set of AV cues and their association strength. Secondly, we use for this AV association strength a score that does not only rely on lips activity, but also on contextual visual information (face size, position, number of detected faces,...) that leads to more reliable association measures. Experiments on 6 hours of broadcast data show that our framework is able to improve the AV-person diarization especially for speaker segments erroneously labeled in the mono-modal case.

[1]  Christian Petersohn,et al.  Role-based identity recognition for telecasts , 2010, AIEMPro '10.

[2]  Alfred Dielmann,et al.  Unsupervised detection of multimodal clusters in edited recordings , 2010, 2010 IEEE International Workshop on Multimedia Signal Processing.

[3]  Elie el Khoury,et al.  Unsupervised Video Indexing based on Audiovisual Characterization of Persons. (Indexation vidéo non-supervisée basée sur la caractérisation des personnes) , 2010 .

[4]  Elie el Khoury,et al.  Association of Audio and Video Segmentations for Automatic Person Indexing , 2007, 2007 International Workshop on Content-Based Multimedia Indexing.

[5]  Andrew Zisserman,et al.  Hello! My name is... Buffy'' -- Automatic Naming of Characters in TV Video , 2006, BMVC.

[6]  Jean-Marc Odobez,et al.  Fusing matching and biometric similarity measures for face diarization in video , 2013, ICMR '13.

[7]  Gérard Chollet,et al.  People indexing in TV-content using lip-activity and unsupervised audio-visual identity verification , 2011, 2011 9th International Workshop on Content-Based Multimedia Indexing (CBMI).

[8]  Nicholas W. D. Evans,et al.  A multimodal approach to initialisation for top-down speaker diarization of television shows , 2010, 2010 18th European Signal Processing Conference.

[9]  Rama Chellappa,et al.  Face Association across Unconstrained Video Frames Using Conditional Random Fields , 2012, ECCV.

[10]  Jean-Luc Gauvain,et al.  Multistage speaker diarization of broadcast news , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Zhu Liu,et al.  Major Cast Detection in Video Using Both Speaker and Face Information , 2007, IEEE Transactions on Multimedia.

[12]  Shih-Fu Chang,et al.  Structured exploration of who, what, when, and where in heterogeneous multimedia news sources , 2013, ACM Multimedia.

[13]  Cordelia Schmid,et al.  Unsupervised metric learning for face identification in TV video , 2011, 2011 International Conference on Computer Vision.

[14]  Sébastien Marcel,et al.  Cross-Pollination of Normalization Techniques From Speaker to Face Authentication Using Gaussian Mixture Models , 2012, IEEE Transactions on Information Forensics and Security.

[15]  Mickael Rouvier,et al.  An open-source state-of-the-art toolbox for broadcast news diarization , 2013, INTERSPEECH.

[16]  Rainer Stiefelhagen,et al.  “Knock! Knock! Who is it?” probabilistic person identification in TV-series , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Philippe Joly,et al.  Face-and-clothing based people clustering in video content , 2010, MIR '10.

[18]  Frédéric Bimbot,et al.  Speaker diarization using bottom-up clustering based on a parameter-derived distance between adapted GMMs , 2004, INTERSPEECH.

[19]  Olivier Galibert,et al.  The First Official REPERE Evaluation , 2013, SLAM@INTERSPEECH.

[20]  Jean-Marc Odobez,et al.  Audio-Video Person Clustering in Video Databases , 2003 .