Partial AUC Optimization Based Deep Speaker Embeddings with Class-Center Learning for Text-Independent Speaker Verification

Deep embedding based text-independent speaker verification has demonstrated superior performance to traditional methods in many challenging scenarios. Its loss functions can be generally categorized into two classes, i.e., verification and identification. The verification loss functions match the pipeline of speaker verification, but their implementations are difficult. Thus, most state-of-the-art deep embedding methods use the identification loss functions with softmax output units or their variants. In this paper, we propose a verification loss function, named the maximization of partial area under the Receiver-operating-characteristic (ROC) curve (pAUC), for deep embedding based text-independent speaker verification. We also propose a class-center based training trial construction method to improve the training efficiency, which is critical for the proposed loss function to be comparable to the identification loss in performance. Experiments on the Speaker in the Wild (SITW) and NIST SRE 2016 datasets show that the proposed pAUC loss function is highly competitive with the state-of-the-art identification loss functions.

[1]  Bowen Zhou,et al.  Deep Speaker Embedding Learning with Multi-level Pooling for Text-independent Speaker Verification , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Hye-jin Shim,et al.  A Complete End-to-End Speaker Verification System Using Deep Neural Networks: From Raw Signals to Verification Result , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Ming Li,et al.  Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System , 2018, Odyssey.

[4]  Quan Wang,et al.  Generalized End-to-End Loss for Speaker Verification , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Kai Yu,et al.  Discriminative Neural Embedding Learning for Short-Duration Text-Independent Speaker Verification , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  Aaron Lawson,et al.  The Speakers in the Wild (SITW) Speaker Recognition Database , 2016, INTERSPEECH.

[7]  Victoria Mingote,et al.  Optimization of False Acceptance/Rejection Rates and Decision Threshold for End-to-End Text-Dependent Speaker Verification Systems , 2019, INTERSPEECH.

[8]  Jia Liu,et al.  Large Margin Softmax Loss for Speaker Verification , 2019, INTERSPEECH.

[9]  Dong Wang,et al.  Deep Speaker Feature Learning for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[10]  Xiao Liu,et al.  Deep Speaker: an End-to-End Neural Speaker Embedding System , 2017, ArXiv.

[11]  Sanjeev Khudanpur,et al.  Speaker Recognition for Multi-speaker Conversations Using X-vectors , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Dong Yu,et al.  Boundary Discriminative Large Margin Cosine Loss for Text-independent Speaker Verification , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Ian McLoughlin,et al.  Improving Aggregation and Loss Function for Better Embedding Learning in End-to-End Speaker Verification System , 2019, INTERSPEECH.

[15]  Georg Heigold,et al.  End-to-end text-dependent speaker verification , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Eduardo Lleida,et al.  Optimization of the Area Under the ROC Curve using Neural Network Supervectors for Text-Dependent Speaker Verification , 2019, Comput. Speech Lang..

[17]  Sanjeev Khudanpur,et al.  State-of-the-Art Speaker Recognition for Telephone and Video Speech: The JHU-MIT Submission for NIST SRE18 , 2019, INTERSPEECH.

[18]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[19]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[20]  Daniel Povey,et al.  Self-Attentive Speaker Embeddings for Text-Independent Speaker Verification , 2018, INTERSPEECH.

[21]  John H. L. Hansen,et al.  Text-Independent Speaker Verification Based on Triplet Convolutional Neural Network Embeddings , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Patrick Kenny,et al.  Deep Speaker Embeddings for Short-Duration Speaker Verification , 2017, INTERSPEECH.

[23]  Sergey Novoselov,et al.  Triplet Loss Based Cosine Similarity Metric Learning for Text-independent Speaker Recognition , 2018, INTERSPEECH.

[24]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[25]  Xiao-Lei Zhang,et al.  Partial AUC metric learning based speaker verification back-end , 2019 .

[26]  Joon Son Chung,et al.  Utterance-level Aggregation for Speaker Recognition in the Wild , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).