VoxSRC 2020: The Second VoxCeleb Speaker Recognition Challenge

We held the second installment of the VoxCeleb Speaker Recognition Challenge in conjunction with Interspeech 2020. The goal of this challenge was to assess how well current speaker recognition technology is able to diarise and recognize speakers in unconstrained or `in the wild' data. It consisted of: (i) a publicly available speaker recognition and diarisation dataset from YouTube videos together with ground truth annotation and standardised evaluation software; and (ii) a virtual public challenge and workshop held at Interspeech 2020. This paper outlines the challenge, and describes the baselines, methods used, and results. We conclude with a discussion of the progress over the first installment of the challenge.

[1]  Xu Xiang,et al.  The xx205 System for the VoxCeleb Speaker Recognition Challenge 2020 , 2020, ArXiv.

[2]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Stefanos Zafeiriou,et al.  Sub-center ArcFace: Boosting Face Recognition by Large-Scale Noisy Web Faces , 2020, ECCV.

[5]  Kyunghyun Cho,et al.  A Framework For Contrastive Self-Supervised Learning And Designing A New Approach , 2020, ArXiv.

[6]  Shinji Watanabe,et al.  Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge , 2018, INTERSPEECH.

[7]  Shuicheng Yan,et al.  Dual Path Networks , 2017, NIPS.

[8]  Hossein Sameti,et al.  DeepMine Speech Processing Database: Text-Dependent and Independent Speaker Verification and Speech Recognition in Persian and English , 2018, Odyssey.

[9]  Jean Carletta,et al.  Unleashing the killer corpus: experiences in creating the multi-everything AMI Meeting Corpus , 2007, Lang. Resour. Evaluation.

[10]  Joon Son Chung,et al.  Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[11]  Pavel Korshunov,et al.  Pyannote.Audio: Neural Building Blocks for Speaker Diarization , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Shinji Watanabe,et al.  Augmentation adversarial training for unsupervised speaker recognition , 2020, ArXiv.

[13]  Mireia Díez,et al.  Analysis of the BUT Diarization System for VoxConverse Challenge , 2020, ArXiv.

[14]  Weiqing Wang,et al.  The DKU-DukeECE Systems for VoxCeleb Speaker Recognition Challenge 2020. , 2020, 2010.12731.

[15]  Kenneth Ward Church,et al.  The Second DIHARD Diarization Challenge: Dataset, task, and baselines , 2019, INTERSPEECH.

[16]  Joon Son Chung,et al.  Spot the conversation: speaker diarisation in the wild , 2020, INTERSPEECH.

[17]  Joon Son Chung,et al.  VoxSRC 2019: The first VoxCeleb Speaker Recognition Challenge , 2019, ArXiv.

[18]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[19]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Muhammad Umair Ahmed Khan,et al.  The UPC Speaker Verification System Submitted to VoxCeleb Speaker Recognition Challenge 2020 (VoxSRC-20) , 2020, ArXiv.

[21]  Joon Son Chung,et al.  Playing a Part: Speaker Verification at the Movies , 2020, ArXiv.

[22]  Seyed Omid Sadjadi,et al.  The 2019 NIST Speaker Recognition Evaluation CTS Challenge , 2020, Odyssey.

[23]  Nakamasa Inoue,et al.  Semi-Supervised Contrastive Learning with Generalized Contrastive Loss and Its Application to Speaker Recognition , 2020, 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[24]  Naoyuki Kanda,et al.  Microsoft Speaker Diarization System for the Voxceleb Speaker Recognition Challenge 2020 , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Kenneth Ward Church,et al.  Third DIHARD Challenge Evaluation Plan , 2020, ArXiv.

[26]  Andreas Stolcke,et al.  Dover: A Method for Combining Diarization Outputs , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[27]  Vincent M. Stanford,et al.  The 2021 NIST Speaker Recognition Evaluation , 2022, Odyssey.

[28]  Joon Son Chung,et al.  Voxceleb: Large-scale speaker verification in the wild , 2020, Comput. Speech Lang..

[29]  Lukás Burget,et al.  Self-supervised speaker embeddings , 2019, INTERSPEECH.

[30]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[31]  Joon Son Chung,et al.  Disentangled Speech Embeddings Using Cross-Modal Self-Supervision , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Andrew Zisserman,et al.  Condensed Movies: Story Based Retrieval with Contextual Embeddings , 2020, ACCV.

[33]  Joon Son Chung,et al.  The Conversation: Deep Audio-Visual Speech Enhancement , 2018, INTERSPEECH.

[34]  Xiang Xu The xx205 System for the VoxCeleb Speaker Recognition Challenge 2020. , 2020 .

[35]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[36]  Joon Son Chung,et al.  In defence of metric learning for speaker recognition , 2020, INTERSPEECH.

[37]  Kai Zhao,et al.  Res2Net: A New Multi-Scale Backbone Architecture , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Shrikanth Narayanan,et al.  The INTERSPEECH 2020 Far-Field Speaker Verification Challenge , 2020, INTERSPEECH.

[39]  Daniel Povey,et al.  MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[40]  Jian Cheng,et al.  Additive Margin Softmax for Face Verification , 2018, IEEE Signal Processing Letters.

[41]  Colleen Richey,et al.  The VOiCES from a Distance Challenge 2019 Evaluation Plan , 2019, ArXiv.

[42]  Jenthe Thienpondt,et al.  The IDLAB VoxCeleb Speaker Recognition Challenge 2020 System Description , 2020, ArXiv.

[43]  Sanjeev Khudanpur,et al.  A study on data augmentation of reverberant speech for robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Jenthe Thienpondt,et al.  ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification , 2020, INTERSPEECH.

[45]  Douglas A. Reynolds,et al.  The 2018 NIST Speaker Recognition Evaluation , 2019, INTERSPEECH.

[46]  Shuai Wang,et al.  BUT System Description to VoxCeleb Speaker Recognition Challenge 2019 , 2019, ArXiv.

[47]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[48]  Andrew Zisserman,et al.  Vggsound: A Large-Scale Audio-Visual Dataset , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[50]  Sungroh Yoon,et al.  Momentum Contrast Speaker Representation Learning , 2020, ArXiv.

[51]  Jon Barker,et al.  CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings , 2020, 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020).