Contrastive Self-Supervised Learning for Text-Independent Speaker Verification

Current speaker verification models rely on supervised training with massive annotated data. But the collection of labeled utterances from multiple speakers is expensive and facing privacy issues. To open up an opportunity for utilizing massive unlabeled utterance data, our work exploits a contrastive self-supervised learning (CSSL) approach for text-independent speaker verification task. The core principle of CSSL lies in minimizing the distance between the embeddings of augmented segments truncated from the same utterance as well as maximizing those from different utterances. We proposed channel-invariant loss to prevent the network from encoding the undesired channel information into the speaker representation. Bearing these in mind, we conduct intensive experiments on VoxCeleb1&2 datasets. The self-supervised thin-ResNet34 fine-tuned with only 5% of the labeled data can achieve comparable performance to the fully supervised model, which is meaningful to economize lots of manual annotation.

[1]  Phillip Isola,et al.  Contrastive Multiview Coding , 2019, ECCV.

[2]  Alexander Kolesnikov,et al.  S4L: Self-Supervised Semi-Supervised Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Joon Son Chung,et al.  Delving into VoxCeleb: environment invariant speaker recognition , 2019, ArXiv.

[4]  Ali Razavi,et al.  Data-Efficient Image Recognition with Contrastive Predictive Coding , 2019, ICML.

[5]  Paolo Favaro,et al.  Boosting Self-Supervised Learning via Knowledge Transfer , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Douglas A. Reynolds,et al.  The 2018 NIST Speaker Recognition Evaluation , 2019, INTERSPEECH.

[7]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[8]  Hye-jin Shim,et al.  RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification , 2019, INTERSPEECH.

[9]  Ming Li,et al.  Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System , 2018, Odyssey.

[10]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[11]  Lukás Burget,et al.  Self-supervised speaker embeddings , 2019, INTERSPEECH.

[12]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[13]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[14]  Panayiotis G. Georgiou,et al.  Neural Predictive Coding Using Convolutional Neural Networks Toward Unsupervised Learning of Speaker Characteristics , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Joon Son Chung,et al.  Disentangled Speech Embeddings Using Cross-Modal Self-Supervision , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Dong Yu,et al.  Deep Discriminative Embeddings for Duration Robust Speaker Verification , 2018, INTERSPEECH.

[17]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Daniel Povey,et al.  MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[19]  Ming Li,et al.  Within-Sample Variability-Invariant Loss for Robust Speaker Recognition Under Noisy Environments , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Yingli Tian,et al.  Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Sanjeev Khudanpur,et al.  A study on data augmentation of reverberant speech for robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Nakamasa Inoue,et al.  Semi-Supervised Contrastive Learning with Generalized Contrastive Loss and Its Application to Speaker Recognition , 2020, 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[23]  Joon Son Chung,et al.  In defence of metric learning for speaker recognition , 2020, INTERSPEECH.

[24]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).