The IDLAB VoxCeleb Speaker Recognition Challenge 2020 System Description

In this technical report we describe the IDLAB top-scoring submissions for the VoxCeleb Speaker Recognition Challenge 2020 (VoxSRC-20) in the supervised and unsupervised speaker verification tracks. For the supervised verification tracks we trained 6 state-of-the-art ECAPA-TDNN systems and 4 Resnet34 based systems with architectural variations. On all models we apply a large margin fine-tuning strategy, which enables the training procedure to use higher margin penalties by using longer training utterances. In addition, we use quality-aware score calibration which introduces quality metrics in the calibration system to generate more consistent scores across varying levels of utterance conditions. A fusion of all systems with both enhancements applied led to the first place on the open and closed supervised verification tracks. The unsupervised system is trained through contrastive learning. Subsequent pseudo-label generation by iterative clustering of the training embeddings allows the use of supervised techniques. This procedure led to the winning submission on the unsupervised track, and its performance is closing in on supervised training.

[1]  Leslie N. Smith,et al.  Cyclical Learning Rates for Training Neural Networks , 2015, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[2]  Nakamasa Inoue,et al.  Semi-Supervised Contrastive Learning with Generalized Contrastive Loss and Its Application to Speaker Recognition , 2020, 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[3]  Daniel Müllner,et al.  fastcluster: Fast Hierarchical, Agglomerative Clustering Routines for R and Python , 2013 .

[4]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[5]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Stefanos Zafeiriou,et al.  Sub-center ArcFace: Boosting Face Recognition by Large-Scale Noisy Web Faces , 2020, ECCV.

[7]  Pietro Laface,et al.  Comparison of Speaker Recognition Approaches for Real Applications , 2011, INTERSPEECH.

[8]  Shinji Watanabe,et al.  Augmentation adversarial training for unsupervised speaker recognition , 2020, ArXiv.

[9]  Jenthe Thienpondt,et al.  ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification , 2020, INTERSPEECH.

[10]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[11]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12]  Zhuo Chen,et al.  Improving Deep CNN Networks with Long Temporal Context for Text-Independent Speaker Verification , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Lukás Burget,et al.  A Multi Purpose and Large Scale Speech Corpus in Persian and English for Speaker and Speech Recognition: The Deepmine Database , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[14]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Alan McCree,et al.  MagNetO: X-vector Magnitude Estimation Network plus Offset for Improved Speaker Recognition , 2020, Odyssey.

[16]  Sanjeev Khudanpur,et al.  A study on data augmentation of reverberant speech for robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Jenthe Thienpondt,et al.  The IDLAB VoxSRC-20 Submission: Large Margin Fine-Tuning and Quality-Aware Score Calibration in DNN Based Speaker Verification , 2020, ArXiv.

[18]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[19]  D. Sculley,et al.  Web-scale k-means clustering , 2010, WWW '10.

[20]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[21]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[22]  Patrick Wambacq,et al.  SPRAAK: an open source "SPeech recognition and automatic annotation kit" , 2008, INTERSPEECH.

[23]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[24]  Kai Zhao,et al.  Res2Net: A New Multi-Scale Backbone Architecture , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Daniel Povey,et al.  MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[26]  Jenthe Thienpondt,et al.  Cross-Lingual Speaker Verification with Domain-Balanced Hard Prototype Mining and Language-Dependent Score Normalization , 2020, INTERSPEECH.

[27]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.