Boundary Discriminative Large Margin Cosine Loss for Text-independent Speaker Verification

Deep neural network based speaker embeddings have attracted much attention in text-independent speaker verification task. In addition to the network architecture, an appropriate design of the loss function is crucial for the deep discriminative embedding extractor. Inspired by the success of Large Margin Cosine Loss (LMCL) in face recognition, we propose an enhanced LMCL named boundary discriminative LMCL (BD-LMCL) to emphasize the discriminative information inherited in the speaker boundaries. Unlike LMCL, where all training samples contribute equally for the objective function, only the samples around the speaker boundaries are considered during the network training with BD-LMCL. Specifically, those samples close to the boundaries are dynamically selected using top-k zero-one loss. Experimental results on a short duration corpus Android Cellphone and NIST SRE 2012 demonstrate better performance compared to LMCL and other popular loss functions.

[1]  Patrick Kenny,et al.  Deep Speaker Embeddings for Short-Duration Speaker Verification , 2017, INTERSPEECH.

[2]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[3]  Yifan Gong,et al.  End-to-End attention based text-dependent speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[4]  John H. L. Hansen,et al.  Text-Independent Speaker Verification Based on Triplet Convolutional Neural Network Embeddings , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Daniel Povey,et al.  MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[6]  Dong Yu,et al.  Deep Discriminative Embeddings for Duration Robust Speaker Verification , 2018, INTERSPEECH.

[7]  Bhiksha Raj,et al.  SphereFace: Deep Hypersphere Embedding for Face Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Bernt Schiele,et al.  Loss Functions for Top-k Error: Analysis and Insights , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Xing Ji,et al.  CosFace: Large Margin Cosine Loss for Deep Face Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[11]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Sergey Novoselov,et al.  On deep speaker embeddings for text-independent speaker recognition , 2018, Odyssey.

[13]  Xiao Liu,et al.  Deep Speaker: an End-to-End Neural Speaker Embedding System , 2017, ArXiv.

[14]  The NIST Year 2012 Speaker Recognition Evaluation Plan 1 I , 2022 .

[15]  Patrick Kenny,et al.  Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.

[16]  Shuai Wang,et al.  Angular Softmax for Short-Duration Text-independent Speaker Verification , 2018, INTERSPEECH.

[17]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[18]  Sanjeev Khudanpur,et al.  Deep Neural Network Embeddings for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[19]  Sanjeev Khudanpur,et al.  A study on data augmentation of reverberant speech for robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).