Unsupervised Speaker Identification in TV Broadcast Based on Written Names

Identifying speakers in TV broadcast in an unsupervised way (i.e., without biometric models) is a solution for avoiding costly annotations. Existing methods usually use pronounced names, as a source of names, for identifying speech clusters provided by a diarization step but this source is too imprecise for having sufficient confidence. To overcome this issue, another source of names can be used: the names written in a title block in the image track. We first compared these two sources of names on their abilities to provide the name of the speakers in TV broadcast. This study shows that it is more interesting to use written names for their high precision for identifying the current speaker. We also propose two approaches for finding speaker identity based only on names written in the image track. With the “late naming” approach, we propose different propagations of written names onto clusters. Our second proposition, “Early naming,” modifies the speaker diarization module (agglomerative clustering) by adding constraints preventing two clusters with different associated written names to be merged together. These methods were tested on the REPERE corpus phase 1, containing 3 hours of annotated videos. Our best “late naming” system reaches an F-measure of 73.1%. “early naming” improves over this result both in terms of identification error rate and of stability of the clustering stopping criterion. By comparison, a mono-modal, supervised speaker identification system with 535 speaker models trained on matching development data and additional TV and radio data only provided a 57.2% F-measure.

[1]  Claude Barras,et al.  On the use of GSV-SVM for Speaker Diarization and Tracking , 2010, Odyssey.

[2]  Jean-Luc Gauvain,et al.  Multistage speaker diarization of broadcast news , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Mickael Rouvier,et al.  A global optimization framework for speaker diarization , 2012, Odyssey.

[4]  Takeo Kanade,et al.  Name-It: Naming and Detecting Faces in News Videos , 1999, IEEE Multim..

[5]  Olivier Galibert,et al.  The REPERE Corpus : a multimodal corpus for person recognition , 2012, LREC.

[6]  Rong Yan,et al.  Multiple instance learning for labeling faces in broadcasting news video , 2005, MULTIMEDIA '05.

[7]  Sue Tranter Who Really Spoke When? Finding Speaker Turns and Identities in Broadcast News Audio , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[8]  Ricky Houghton Named Faces: Putting Names to Faces , 1999, IEEE Intell. Syst..

[9]  Jean-Luc Gauvain,et al.  The LIMSI Broadcast News transcription system , 2002, Speech Commun..

[10]  Georges Quénot,et al.  QCompere @ REPERE 2013 , 2013, SLAM@INTERSPEECH.

[11]  Jean-Luc Gauvain,et al.  Speaker diarization from speech transcripts , 2004, INTERSPEECH.

[12]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Sylvain Meignier,et al.  Automatic named identification of speakers using diarization and ASR systems , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Paul Deléglise,et al.  Extracting true speaker identities from transcriptions , 2007, INTERSPEECH.

[15]  Georges Quénot,et al.  Unsupervised Speaker Identification using Overlaid Texts in TV Broadcast , 2012, INTERSPEECH.

[16]  Alexandre Allauzen,et al.  Training and Evaluation of POS Taggers on the French MULTITAG Corpus , 2008, LREC.

[17]  Hervé Bredin,et al.  Integer linear programming for speaker diarization and cross-modal identification in TV broadcast , 2013, INTERSPEECH.

[18]  Patrick Nguyen,et al.  Finding Speaker Identities with a Conditional Maximum Entropy Model , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[19]  Georges Quénot,et al.  From Text Detection in Videos to Person Identification , 2012, 2012 IEEE International Conference on Multimedia and Expo.

[20]  Sylvain Meignier,et al.  Identification of Speakers by Name Using Belief Functions , 2010, IPMU.

[21]  Georges Quénot,et al.  Nommage non-supervisé des personnes dans les émissions de télévision : une revue du potentiel de chaque modalité , 2014, CORIA.

[22]  Jun Yang,et al.  Naming every individual in news video monologues , 2004, MULTIMEDIA '04.

[23]  Jean-Luc Gauvain,et al.  Partitioning and transcription of broadcast news data , 1998, ICSLP.

[24]  Georges Quénot,et al.  Fusion of Speech, Faces and Text for Person Identification in TV Broadcast , 2012, ECCV Workshops.

[25]  Ngoc Thang Vu,et al.  Speech recognition for machine translation in Quaero , 2011, IWSLT.

[26]  Julie Mauclair,et al.  Speaker Diarization: About whom the Speaker is Talking ? , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[27]  Olivier Galibert,et al.  The LIMSI Participation in the QAst 2009 Track: Experimenting on Answer Scoring , 2009, CLEF.

[28]  Takeo Kanade,et al.  Video OCR: indexing digital news libraries by recognition of superimposed captions , 1999, Multimedia Systems.

[29]  L. Lamel,et al.  A comparative study using manual and automatic transcriptions for diarization , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[30]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[31]  Georges Quénot,et al.  Towards a Better Integration of Written Names for Unsupervised Speakers Identification in Videos , 2013, SLAM@INTERSPEECH.

[32]  Olivier Galibert,et al.  A presentation of the REPERE challenge , 2012, 2012 10th International Workshop on Content-Based Multimedia Indexing (CBMI).

[33]  Georges Quénot,et al.  Unsupervised naming of speakers in broadcast TV: using written names, pronounced names or both? , 2013, INTERSPEECH.

[34]  Frédéric Béchet,et al.  Detecting person presence in TV shows with linguistic and structural features , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Sophie Rosset,et al.  Models Cascade for Tree-Structured Named Entity Detection , 2011, IJCNLP.

[36]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[37]  Elie el Khoury,et al.  Combining transcription-based and acoustic-based speaker identifications for broadcast news , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Stéphane Ayache,et al.  Speaker Identity Indexing In Audio-Visual Documents , 2005 .

[39]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.