Integer linear programming for speaker diarization and cross-modal identification in TV broadcast

Most state-of-the-art approaches address speaker diariza- tion as a hierarchical agglomerative clustering problem in the audio domain. In this paper, we propose to revisit one of them: speech turns clustering based on the Bayesian Information Cri- terion (a.k.a. BIC clustering). First, we show how to model it as an integer linear programming (ILP) problem. Its resolu- tion leads to the same overall diarization error rate as standard BIC clustering but generates significantly purer speaker clus- ters. Then, we describe how this approach can easily be ex- tended to the audiovisual domain and TV broadcast in particu- lar. The straightforward integration of detected overlaid names (used to introduce guests or journalists, and obtained via video OCR) into a multimodal ILP problem yields significantly better speaker diarization results. Finally, we explain how this novel paradigm can incidentally be used for unsupervised speaker identification (i.e. not relying on any prior acoustic speaker models). Experiments on the REPERE TV broadcast corpus show that it achieves performance close to that of an oracle ca- pable of identifying any speaker as long as their name appears on screen at least once in the video.

[1]  Douglas A. Reynolds,et al.  A study of new approaches to speaker diarization , 2009, INTERSPEECH.

[2]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[4]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Sylvain Meignier,et al.  Automatic named identification of speakers using diarization and ASR systems , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Jordi Luque,et al.  On the use of agglomerative and spectral clustering in speaker diarization of meetings , 2012, Odyssey.

[7]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Gwenn Englebienne,et al.  Multimodal Speaker Diarization , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[10]  A. Land,et al.  An Automatic Method for Solving Discrete Programming Problems , 1960, 50 Years of Integer Programming.

[11]  Jean-Luc Gauvain,et al.  Multistage speaker diarization of broadcast news , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Olivier Galibert,et al.  The REPERE Corpus : a multimodal corpus for person recognition , 2012, LREC.

[13]  Christopher D. Manning,et al.  Enforcing Transitivity in Coreference Resolution , 2008, ACL.

[14]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[15]  Rainer Stiefelhagen,et al.  “Knock! Knock! Who is it?” probabilistic person identification in TV-series , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Francisco Javier,et al.  On the use of agglomerative and spectral clustering in speaker diarization of meetings , 2012 .

[17]  Georges Quénot,et al.  From Text Detection in Videos to Person Identification , 2012, 2012 IEEE International Conference on Multimedia and Expo.

[18]  Mickael Rouvier,et al.  I-vectors and ILP clustering adapted to cross-show speaker diarization , 2012, INTERSPEECH.

[19]  Tomi Kinnunen,et al.  INTERSPEECH 2013 14thAnnual Conference of the International Speech Communication Association , 2013, Interspeech 2015.

[20]  James R. Glass,et al.  On the Use of Spectral and Iterative Methods for Speaker Diarization , 2012, INTERSPEECH.

[21]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Dong Wang,et al.  A Comparative Study of Bottom-Up and Top-Down Approaches to Speaker Diarization , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Georges Quénot,et al.  Unsupervised Speaker Identification using Overlaid Texts in TV Broadcast , 2012, INTERSPEECH.

[24]  Thomas S. Huang,et al.  A spectral clustering approach to speaker diarization , 2006, INTERSPEECH.

[25]  R. Smith,et al.  An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[26]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .