OCR-aided person annotation and label propagation for speaker modeling in TV shows

In this paper, we present an approach for minimizing human effort in manual speaker annotation. Label propagation is used at each iteration of an active learning cycle. More precisely, a selection strategy for choosing the most suitable speech track to be labeled is proposed. Four different selection strategies are evaluated and all the tracks in a corresponding cluster are gathered using agglomerative clustering in order to propagate human annotations. To further reduce the manual labor required, an optical character recognition system is used to bootstrap annotations. At each step of the cycle, annotations are used to build speaker models. The quality of the generated speaker models is evaluated at each step using an i-vector based speaker identification system. The presented approach shows promising results on the REPERE corpus with a minimum amount of human effort for annotation.

[1]  Marie-Francine Moens,et al.  Naming People in News Videos with Label Propagation , 2011, IEEE MultiMedia.

[2]  Larry P. Heck,et al.  MSR Identity Toolbox v1.0: A MATLAB Toolbox for Speaker Recognition Research , 2013 .

[3]  Georges Quénot,et al.  From Text Detection in Videos to Person Identification , 2012, 2012 IEEE International Conference on Multimedia and Expo.

[4]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[5]  Claudia Bauzer Medeiros,et al.  Multimedia Semantic Annotation Propagation , 2008, 2008 Tenth IEEE International Symposium on Multimedia.

[6]  Wei Hu,et al.  Unsupervised Active Learning Based on Hierarchical Graph-Theoretic Clustering , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[7]  Stéphane Ayache,et al.  Evaluation of active learning strategies for video indexing , 2007, Signal Process. Image Commun..

[8]  Arnold W. M. Smeulders,et al.  Active learning using pre-clustering , 2004, ICML.

[9]  Ioannis Pratikakis,et al.  A two-stage scheme for text detection in video images , 2010, Image Vis. Comput..

[10]  Georges Quénot,et al.  Active learning with multiple classifiers for multimedia indexing , 2010, 2010 International Workshop on Content Based Multimedia Indexing (CBMI).

[11]  Cordelia Schmid,et al.  Face recognition from caption-based supervision , 2010 .

[12]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[13]  Georges Quénot,et al.  Active selection with label propagation for minimizing human effort in speaker annotation of TV shows , 2014, SLAM@INTERSPEECH.

[14]  Sridha Sridharan,et al.  Feature warping for robust speaker verification , 2001, Odyssey.

[15]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Carla E. Brodley,et al.  Proceedings of the twenty-first international conference on Machine learning , 2004, International Conference on Machine Learning.

[17]  Georges Quénot,et al.  Towards a Better Integration of Written Names for Unsupervised Speakers Identification in Videos , 2013, SLAM@INTERSPEECH.

[18]  Olivier Galibert,et al.  The REPERE Corpus : a multimodal corpus for person recognition , 2012, LREC.