D{é}tection de locuteurs dans les s{é}ries TV

Speaker diarization of audio streams turns out to be particularly challenging when applied to fictional films, where many characters talk in various acoustic conditions (background music, sound effects, variations in intonation...). Despite this acoustic variability, such movies exhibit specific visual patterns, particularly within dialogue scenes. In this paper, we introduce a two-step method to achieve speaker diarization in TV series: speaker diarization is first performed locally within scenes visually identified as dialogues; then, the hypothesized local speakers are compared to each other during a second clustering process in order to detect recurring speakers: this second stage of clustering is subject to the constraint that the different speakers involved in the same dialogue have to be assigned to different clusters. The performances of our approach are compared to those obtained by standard speaker diarization tools applied to the same data.

[1]  S. S. Ravi,et al.  Using instance-level constraints in agglomerative hierarchical clustering: theoretical and empirical results , 2009, Data Mining and Knowledge Discovery.

[2]  Chuohao Yeo,et al.  Multi-modal speaker diarization of real-world meetings using compressed-domain video features , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Dong Wang,et al.  A Comparative Study of Bottom-Up and Top-Down Approaches to Speaker Diarization , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Thierry Bazillon,et al.  Speaker diarization of heterogeneous web video files: A preliminary study , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Nicu Sebe,et al.  Content-based multimedia information retrieval: State of the art and challenges , 2006, TOMCCAP.

[6]  Tien Ping Tan,et al.  Recherche par le contenu dans des documents audiovisuels multilingues , 2009, CORIA.

[7]  Lori Lamel,et al.  Comparing Multi-Stage Approaches for Cross-Show Speaker Diarization , 2011, INTERSPEECH.

[8]  Hervé Bredin,et al.  Segmentation of TV shows into scenes using speaker diarization and speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Jean Carletta,et al.  The AMI Meeting Corpus: A Pre-announcement , 2005, MLMI.

[10]  Delphine Charlet,et al.  Unsupervised face identification in TV content using audio-visual sources , 2013, 2013 11th International Workshop on Content-Based Multimedia Indexing (CBMI).

[11]  Philippe Ercolessi Extraction multimodale de la structure narrative des épisodes de séries télévisées , 2013 .

[12]  Driss Matrouf,et al.  Intersession Compensation and Scoring Methods in the i-vectors Space for Speaker Recognition , 2011, INTERSPEECH.

[13]  John S. Boreczky,et al.  Comparison of video shot boundary detection techniques , 1996, Electronic Imaging.

[14]  John R. Smith,et al.  Semantic Indexing of Multimedia Content Using Visual, Audio, and Text Cues , 2003, EURASIP J. Adv. Signal Process..

[15]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[16]  Mickael Rouvier,et al.  An open-source state-of-the-art toolbox for broadcast news diarization , 2013, INTERSPEECH.