Cosine metric learning based speaker verification

Abstract The performance of speaker verification depends on the overlap region of the decision scores of true and imposter trials. Motivated by the fact that the overlap region can be reduced by maximizing the between-class distance while minimizing the within-class variance of the trials, we present in this paper two cosine metric learning (CML) back-end algorithms. The first one, named m-CML, aims to enlarge the between-class distance with a regularization term to control the within-class variance. The second one, named v-CML, attempts to reduce the within-class variance with a regularization term to prevent the between-class distance from getting smaller. The regularization terms in the CML methods can be initialized by a traditional channel compensation method, e.g., the linear discriminant analysis. These two algorithms are combined with front-end processing for speaker verification. To validate their effectiveness, m-CML is combined with an i-vector front-end since it is good at enlarging the between-class distance of Gaussian score distributions while v-CML is combined with a d-vector or x-vector front-end as it is able to reduce the within-class variance of heavy-tailed score distributions significantly. Experimental results on the NIST and SITW speaker recognition evaluation corpora show that the proposed algorithms outperform their initialization channel compensation methods, and are competitive to the probabilistic linear discriminant analysis back-end in terms of performance. For comparison, we also applied the m-CML and v-CML methods to the i-vector and x-vector front-ends.

[1]  Patrick Kenny,et al.  First attempt of boltzmann machines for speaker verification , 2012, Odyssey.

[2]  Themos Stafylakis,et al.  Preliminary investigation of Boltzmann machine classifiers for speaker recognition , 2012, Odyssey.

[3]  John H. L. Hansen,et al.  Maximum Likelihood Acoustic Factor Analysis Models for Robust Speaker Verification in Noise , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Jen-Tzung Chien,et al.  Mixture of PLDA for Noise Robust I-Vector Speaker Verification , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[6]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[9]  Andreas Stolcke,et al.  Within-class covariance normalization for SVM-based speaker recognition , 2006, INTERSPEECH.

[10]  Pietro Laface,et al.  Large-Scale Training of Pairwise Support Vector Machines for Speaker Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Sanjeev Khudanpur,et al.  x-Vector DNN Refinement with Full-Length Recordings for Speaker Recognition , 2019, INTERSPEECH.

[12]  Ruhi Sarikaya,et al.  Bottleneck features for speaker recognition , 2012, Odyssey.

[13]  Jingdong Chen,et al.  Cosine Metric Learning for Speaker Verification in the I-vector Space , 2018, INTERSPEECH.

[14]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[15]  Thomas Fang Zheng,et al.  Max-margin metric learning for speaker recognition , 2016, 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[16]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[17]  William M. Campbell,et al.  Using deep belief networks for vector-based speaker recognition , 2014, INTERSPEECH.

[18]  Pietro Laface,et al.  Nonlinear I-Vector Transformations for PLDA-Based Speaker Recognition , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Rajesh M. Hegde,et al.  Cosine Distance Metric Learning for Speaker Verification Using Large Margin Nearest Neighbor Method , 2014, PCM.

[20]  Themos Stafylakis,et al.  Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition , 2014, Odyssey.

[21]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[22]  Javier Hernando,et al.  Deep belief networks for i-vector based speaker recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Georg Heigold,et al.  End-to-end text-dependent speaker verification , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Patrick Kenny,et al.  Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.

[25]  Man-Wai Mak,et al.  DNN-Based Score Calibration With Multitask Learning for Noise Robust Speaker Verification , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26]  Douglas A. Reynolds,et al.  Deep Neural Network Approaches to Speaker and Language Recognition , 2015, IEEE Signal Processing Letters.

[27]  DeLiang Wang,et al.  Robust speaker recognition based on DNN/i-vectors and speech separation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Douglas E. Sturim,et al.  SVM Based Speaker Verification using a GMM Supervector Kernel and NAP Variability Compensation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[29]  Man-Wai Mak,et al.  SNR-Invariant PLDA Modeling in Nonparametric Subspace for Robust Speaker Verification , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[30]  Larry P. Heck,et al.  MSR Identity Toolbox v1.0: A MATLAB Toolbox for Speaker Recognition Research , 2013 .

[31]  Javier Hernando,et al.  Deep Learning Backend for Single and Multisession i-Vector Speaker Recognition , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[32]  Han Ji-qing,et al.  Deep Neural Network Based Discriminative Training for I-Vector/PLDA Speaker Verification , 2018, ICASSP 2018.

[33]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[34]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[35]  Sanjeev Khudanpur,et al.  Deep Neural Network Embeddings for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[36]  Joon Son Chung,et al.  Utterance-level Aggregation for Speaker Recognition in the Wild , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[38]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Sanjeev Khudanpur,et al.  Deep neural network-based speaker embeddings for end-to-end speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[40]  Jen-Tzung Chien,et al.  DNN-Driven Mixture of PLDA for Robust Speaker Verification , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[41]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .