Improving Speaker Diarization of TV Series using Talking-Face Detection and Clustering

While successful on broadcast news, meetings or telephone conversation, state-of-the-art speaker diarization techniques tend to perform poorly on TV series or movies. In this paper, we propose to rely on state-of-the-art face clustering techniques to guide acoustic speaker diarization. Two approaches are tested and evaluated on the first season of Game Of Thrones TV series. The second (better) approach relies on a novel talking-face detection module based on bi-directional long short-term memory recurrent neural network. Both audio-visual approaches outperform the audio-only baseline. A detailed study of the behavior of these approaches is also provided and paves the way to future improvements.

[1]  Davis E. King,et al.  Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[2]  Rainer Stiefelhagen,et al.  Improved weak labels using contextual cues for person identification in videos , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[3]  Andrew Zisserman,et al.  Hello! My name is... Buffy'' -- Automatic Naming of Characters in TV Video , 2006, BMVC.

[4]  Paul J. Werbos,et al.  Backpropagation Through Time: What It Does and How to Do It , 1990, Proc. IEEE.

[5]  Josephine Sullivan,et al.  One millisecond face alignment with an ensemble of regression trees , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Thomas Fillon,et al.  YAAFE, an Easy to Use and Efficient Audio Feature Extraction Software , 2010, ISMIR.

[8]  Mahadev Satyanarayanan,et al.  OpenFace: A general-purpose face recognition library with mobile applications , 2016 .

[9]  Jean-Luc Gauvain,et al.  Minimum word error training of RNN-based voice activity detection , 2015, INTERSPEECH.

[10]  Georges Linarès,et al.  Constrained speaker diarization of TV series based on visual patterns , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[11]  Michael Felsberg,et al.  Accurate Scale Estimation for Robust Visual Tracking , 2014, BMVC.

[12]  Camille Guinaudeau,et al.  TVD: A Reproducible and Multiply Aligned TV Series Dataset , 2014, LREC.

[13]  David D. Cox,et al.  Hyperopt: A Python Library for Optimizing the Hyperparameters of Machine Learning Algorithms , 2013, SciPy.

[14]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[15]  Thierry Bazillon,et al.  Speaker diarization of heterogeneous web video files: A preliminary study , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[17]  Sylvain Meignier,et al.  Automatic named identification of speakers using diarization and ASR systems , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Andrew Zisserman,et al.  Taking the bite out of automated naming of characters in TV video , 2009, Image Vis. Comput..

[19]  Xiaojun Wu,et al.  Convergence analysis and improvements of quantum-behaved particle swarm optimization , 2012, Inf. Sci..

[20]  David D. Cox,et al.  Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures , 2013, ICML.

[21]  Alexandre Allauzen,et al.  "Sheldon speaking, Bonjour!": Leveraging Multilingual Tracks for (Weakly) Supervised Speaker Identification , 2014, ACM Multimedia.

[22]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Rainer Stiefelhagen,et al.  Book2Movie: Aligning video scenes with book chapters , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  William J. Christmas,et al.  A Study on Automatic Shot Change Detection , 1998, ECMAST.