论文信息 - Large-Scale Self-Supervised Speech Representation Learning for Automatic Speaker Verification

Large-Scale Self-Supervised Speech Representation Learning for Automatic Speaker Verification

The speech representations learned from large-scale unlabeled data have shown better generalizability than those from supervised learning and thus attract a lot of interest to be applied for various downstream tasks. In this paper, we explore the limits of speech representations learned by different self-supervised objectives and datasets for automatic speaker verification (ASV), especially with a well-recognized SOTA ASV model, ECAPA-TDNN [1], as a downstream model. The representations from all hidden layers of the pre-trained model are firstly averaged with learnable weights and then fed into the ECAPA-TDNN as input features. The experimental results on Voxceleb dataset show that the weighted average representation is significantly superior to FBank, a conventional handcrafted feature for ASV. Our best single system achieves 0.537%, 0.569%, and 1.180% equal error rate (EER) on the three official trials of VoxCeleb1, separately. Accordingly, the ensemble system with three pre-trained models can further improve the EER to 0.479%, 0.536% and 1.023%. Among the three evaluation trials, our best system outperforms the winner system [2] of the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC2021) on the VoxCeleb1-E trial.

[1] Alexei Baevski,et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[2] Meng Li,et al. Exploring wav2vec 2.0 on speaker verification and language identification , 2020, Interspeech.

[3] Kris Demuynck,et al. The Idlab Voxsrc-20 Submission: Large Margin Fine-Tuning and Quality-Aware Score Calibration in DNN Based Speaker Verification , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Daniel Povey,et al. Self-Attentive Speaker Embeddings for Text-Independent Speaker Verification , 2018, INTERSPEECH.

[5] Ruslan Salakhutdinov,et al. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6] Kris Demuynck,et al. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification , 2020, INTERSPEECH.

[7] Furu Wei,et al. Unispeech-Sat: Universal Speech Representation Learning With Speaker Aware Pre-Training , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] Hye-jin Shim,et al. RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification , 2019, INTERSPEECH.

[9] Nicholas W. D. Evans,et al. Constant Q cepstral coefficients: A spoofing countermeasure for automatic speaker verification , 2017, Comput. Speech Lang..

[10] Nicholas W. D. Evans,et al. Articulation Rate Filtering of CQCC Features for Automatic Speaker Verification , 2016, INTERSPEECH.

[11] Ya Zhang,et al. Deep feature for text-dependent speaker verification , 2015, Speech Commun..

[12] Enhua Wu,et al. Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13] Stefanos Zafeiriou,et al. ArcFace: Additive Angular Margin Loss for Deep Face Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Sanjeev Khudanpur,et al. X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .

[16] Yoshua Bengio,et al. Speaker Recognition from Raw Waveform with SincNet , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[17] Joon Son Chung,et al. VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[18] Pietro Laface,et al. Comparison of Speaker Recognition Approaches for Real Applications , 2011, INTERSPEECH.

[19] Shuai Wang,et al. Margin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition , 2019, 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[20] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[21] Shuai Wang,et al. BUT System Description to VoxCeleb Speaker Recognition Challenge 2019 , 2019, ArXiv.

[22] Chunlei Zhang,et al. End-to-End Text-Independent Speaker Verification with Triplet Loss on Short Utterances , 2017, INTERSPEECH.

[23] Shinji Watanabe,et al. SUPERB: Speech processing Universal PERformance Benchmark , 2021, Interspeech.

[24] Karen Livescu,et al. Layer-wise Analysis of a Self-supervised Speech Representation Model , 2021, ArXiv.

[25] Quan Wang,et al. Generalized End-to-End Loss for Speaker Verification , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26] Chunlei Zhang,et al. Self-Supervised Text-Independent Speaker Verification Using Prototypical Momentum Contrastive Learning , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27] Yuexian Zou,et al. Contrastive Self-Supervised Learning for Text-Independent Speaker Verification , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28] Joon Son Chung,et al. VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[29] William M. Campbell,et al. Towards reduced false-alarms using cohorts , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30] Weiqing Wang,et al. An Iterative Framework for Self-Supervised Deep Speaker Representation Learning , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31] Benoît Sagot,et al. What Does BERT Learn about the Structure of Language? , 2019, ACL.

[32] Koichi Shinoda,et al. Attentive Statistics Pooling for Deep Speaker Embedding , 2018, INTERSPEECH.

[33] Kai Zhao,et al. Res2Net: A New Multi-Scale Backbone Architecture , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34] Luciana Ferrer,et al. Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings , 2021, Interspeech.

[35] Daniel Povey,et al. MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[36] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[37] Minqiang Xu,et al. The SpeakIn System for VoxCeleb Speaker Recognition Challange 2021 , 2021, ArXiv.