Momentum Contrast Speaker Representation Learning

Unsupervised representation learning has shown remarkable achievement by reducing the performance gap with supervised feature learning, especially in the image domain. In this study, to extend the technique of unsupervised learning to the speech domain, we propose the Momentum Contrast for VoxCeleb (MoCoVox) as a form of learning mechanism. We pre-trained the MoCoVox on the VoxCeleb1 by implementing instance discrimination. Applying MoCoVox for speaker verification revealed that it outperforms the state-of-the-art metric learning-based approach by a large margin. We also empirically demonstrate the features of contrastive learning in the speech domain by analyzing the distribution of learned representations. Furthermore, we explored which pretext task is adequate for speaker verification. We expect that learning speaker representation without human supervision helps to address the open-set speaker recognition.

[1]  Stella X. Yu,et al.  Unsupervised Feature Learning via Non-parametric Instance Discrimination , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Daniel Povey,et al.  MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[3]  Stefanos Zafeiriou,et al.  ArcFace: Additive Angular Margin Loss for Deep Face Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[5]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[6]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[7]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[9]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[10]  Joon Son Chung,et al.  In defence of metric learning for speaker recognition , 2020, INTERSPEECH.

[11]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[12]  Joon Son Chung,et al.  Disentangled Speech Embeddings Using Cross-Modal Self-Supervision , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Xing Ji,et al.  CosFace: Large Margin Cosine Loss for Deep Face Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[15]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[16]  Phillip Isola,et al.  Contrastive Multiview Coding , 2019, ECCV.

[17]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[18]  Bhiksha Raj,et al.  SphereFace: Deep Hypersphere Embedding for Face Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  R Devon Hjelm,et al.  Learning Representations by Maximizing Mutual Information Across Views , 2019, NeurIPS.

[20]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[21]  Alexei A. Efros,et al.  Colorful Image Colorization , 2016, ECCV.

[22]  Chris Donahue,et al.  Exploring Speech Enhancement with Generative Adversarial Networks for Robust Speech Recognition , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Nakamasa Inoue,et al.  Semi-Supervised Contrastive Learning with Generalized Contrastive Loss and Its Application to Speaker Recognition , 2020, 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[24]  Hao Tang,et al.  VoiceID Loss: Speech Enhancement for Speaker Verification , 2019, INTERSPEECH.

[25]  Shinji Watanabe,et al.  Augmentation adversarial training for unsupervised speaker recognition , 2020, ArXiv.