Spot the conversation: speaker diarisation in the wild

The goal of this paper is speaker diarisation of videos collected 'in the wild'. We make three key contributions. First, we propose an automatic audio-visual diarisation method for YouTube videos. Our method consists of active speaker detection using audio-visual methods and speaker verification using self-enrolled speaker models. Second, we integrate our method into a semi-automatic dataset creation pipeline which significantly reduces the number of hours required to annotate videos with diarisation labels. Finally, we use this pipeline to create a large-scale diarisation dataset called VoxConverse, collected from 'in the wild' videos, which we will release publicly to the research community. Our dataset consists of overlapping speech, a large and diverse speaker pool, and challenging background conditions.

[1]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[2]  H. Edelsbrunner,et al.  Efficient algorithms for agglomerative hierarchical clustering methods , 1984 .

[3]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Jason W. Pelecanos,et al.  Online speaker diarization using adapted i-vector transforms , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Daniel C. Burnett,et al.  WebRTC: APIs and RTCWEB Protocols of the HTML5 Real-Time Web , 2012 .

[6]  Joon Son Chung,et al.  In defence of metric learning for speaker recognition , 2020, INTERSPEECH.

[7]  Jean Carletta,et al.  The AMI Meeting Corpus: A Pre-announcement , 2005, MLMI.

[8]  Daniel Garcia-Romero,et al.  Speaker diarization with plda i-vector scoring and unsupervised calibration , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[9]  Shinji Watanabe,et al.  Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge , 2018, INTERSPEECH.

[10]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Hervé Bourlard,et al.  Improved overlap speech diarization of meeting recordings using long-term conversational features , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Abhishek Dutta,et al.  The VIA Annotation Software for Images, Audio and Video , 2019, ACM Multimedia.

[13]  Joon Son Chung,et al.  Voxceleb: Large-scale speaker verification in the wild , 2020, Comput. Speech Lang..

[14]  Joon Son Chung,et al.  Utterance-level Aggregation for Speaker Recognition in the Wild , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Quan Wang,et al.  Fully Supervised Speaker Diarization , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Andrew Zisserman,et al.  Deep Face Recognition , 2015, BMVC.

[17]  Joon Son Chung,et al.  Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[18]  Kenneth Ward Church,et al.  The Second DIHARD Diarization Challenge: Dataset, task, and baselines , 2019, INTERSPEECH.

[19]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[20]  Jun Du,et al.  Speaker Diarization with Enhancing Speech for the First DIHARD Challenge , 2018, INTERSPEECH.

[21]  Shifeng Zhang,et al.  S^3FD: Single Shot Scale-Invariant Face Detector , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[22]  Alan McCree,et al.  Speaker diarization using deep neural network embeddings , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Andreas Stolcke,et al.  The ICSI Meeting Corpus , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[25]  Joon Son Chung,et al.  My lips are concealed: Audio-visual speech enhancement through obstructions , 2019, INTERSPEECH.

[26]  Joon Son Chung,et al.  The Conversation: Deep Audio-Visual Speech Enhancement , 2018, INTERSPEECH.

[27]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[28]  Joon Son Chung,et al.  Perfect Match: Improved Cross-modal Embeddings for Audio-visual Synchronisation , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[30]  Joon Son Chung,et al.  Lip Reading in Profile , 2017, BMVC.

[31]  Abhishek Dutta,et al.  The VGG Image Annotator (VIA) , 2019, ArXiv.

[32]  Douglas A. Reynolds,et al.  Speaker diarisation for broadcast news , 2004, Odyssey.

[33]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[34]  Joon Son Chung,et al.  Who said that?: Audio-visual speaker diarisation of real-world meetings , 2019, INTERSPEECH.

[35]  Andrew Zisserman,et al.  Vggsound: A Large-Scale Audio-Visual Dataset , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Sergey Ioffe,et al.  Probabilistic Linear Discriminant Analysis , 2006, ECCV.

[37]  Themos Stafylakis,et al.  PLDA for speaker verification with utterances of arbitrary duration , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.