In defence of metric learning for speaker recognition

The objective of this paper is 'open-set' speaker recognition of unseen speakers, where ideal embeddings should be able to condense information into a compact utterance-level representation that has small intra-speaker and large inter-speaker distance. A popular belief in speaker recognition is that networks trained with classification objectives outperform metric learning methods. In this paper, we present an extensive evaluation of most popular loss functions for speaker recognition on the VoxCeleb dataset. We demonstrate that the vanilla triplet loss shows competitive performance compared to classification-based losses, and those trained with our proposed metric learning objective outperform state-of-the-art methods.

[1]  Joon Son Chung,et al.  Delving into VoxCeleb: environment invariant speaker recognition , 2019, ArXiv.

[2]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[3]  Daniel Povey,et al.  MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[4]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Koichi Shinoda,et al.  Attentive Statistics Pooling for Deep Speaker Embedding , 2018, INTERSPEECH.

[7]  Shuai Wang,et al.  BUT System Description to VoxCeleb Speaker Recognition Challenge 2019 , 2019, ArXiv.

[8]  John H. L. Hansen,et al.  Text-Independent Speaker Verification Based on Triplet Convolutional Neural Network Embeddings , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Steve Renals,et al.  Channel Adversarial Training for Speaker Verification and Diarization , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[11]  Andrea Vedaldi,et al.  Instance Normalization: The Missing Ingredient for Fast Stylization , 2016, ArXiv.

[12]  Quan Wang,et al.  Attention-Based Models for Text-Dependent Speaker Verification , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Jia Liu,et al.  Large Margin Softmax Loss for Speaker Verification , 2019, INTERSPEECH.

[14]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[15]  Sanjeev Khudanpur,et al.  Deep Neural Network Embeddings for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[16]  Yann LeCun,et al.  Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[17]  Jian Cheng,et al.  Additive Margin Softmax for Face Verification , 2018, IEEE Signal Processing Letters.

[18]  Ming Li,et al.  Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System , 2018, Odyssey.

[19]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[20]  Sanjeev Khudanpur,et al.  The JHU Speaker Recognition System for the VOiCES 2019 Challenge , 2019, INTERSPEECH.

[21]  Jung-Woo Ha,et al.  NSML: A Machine Learning Platform That Enables You to Focus on Your Models , 2017, ArXiv.

[22]  Sriram Ganapathy,et al.  Pairwise Discriminative Neural PLDA for Speaker Verification , 2020, ArXiv.

[23]  Shuai Wang,et al.  Margin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition , 2019, 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[24]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Sanjeev Khudanpur,et al.  Speaker Recognition for Multi-speaker Conversations Using X-vectors , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Xing Ji,et al.  CosFace: Large Margin Cosine Loss for Deep Face Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Steve Renals,et al.  DropClass and DropAdapt: Dropping classes for deep speaker representation learning , 2020, ArXiv.

[28]  Yoshua Bengio,et al.  Speaker Recognition from Raw Waveform with SincNet , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[29]  Joon Son Chung,et al.  Utterance-level Aggregation for Speaker Recognition in the Wild , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Aaron Lawson,et al.  The Speakers in the Wild (SITW) Speaker Recognition Database , 2016, INTERSPEECH.

[31]  Sanjeev Khudanpur,et al.  State-of-the-Art Speaker Recognition for Telephone and Video Speech: The JHU-MIT Submission for NIST SRE18 , 2019, INTERSPEECH.

[32]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[33]  Dengxin Dai,et al.  Unified Hypersphere Embedding for Speaker Recognition , 2018, ArXiv.

[34]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[35]  Jian Wang,et al.  Deep Metric Learning with Angular Loss , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[36]  Bhiksha Raj,et al.  SphereFace: Deep Hypersphere Embedding for Face Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Sanjeev Khudanpur,et al.  x-Vector DNN Refinement with Full-Length Recordings for Speaker Recognition , 2019, INTERSPEECH.

[38]  Sergey Ioffe,et al.  Probabilistic Linear Discriminant Analysis , 2006, ECCV.

[39]  Brejesh Lall,et al.  Few Shot Speaker Recognition using Deep Neural Networks , 2019, ArXiv.

[40]  Frank Rudzicz,et al.  Centroid-based Deep Metric Learning for Speaker Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Stefanos Zafeiriou,et al.  ArcFace: Additive Angular Margin Loss for Deep Face Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Quan Wang,et al.  Generalized End-to-End Loss for Speaker Verification , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).