Cosine Metric Learning for Speaker Verification in the I-vector Space

It is known that the equal-error-rate (EER) performance of a speaker verification system is determined by the overlap region of the decision scores of true and imposter trials. Also, the cosine similarity scores of the true or imposter trials produced by the state-of-the-art i-vector front-end approximate to a Gaussian distribution, and the overlap region of the two classes of trials depends mainly on their between-class distance. Motivated by the above facts, this paper presents a cosine similarity learning (CML) framework for speaker verification, which combines classical compensation techniques and the cosine similarity scoring for improving the EER performance. CML minimizes the overlap region by enlarging the between-class distance while introducing a regularization term to control the with-in class variance, which is initialized by a traditional channel compensation technique such as linear discriminant analysis. Experiments are carried out to compare the proposed CML framework with several traditional channel compensation baselines on the NIST speaker recognition evaluation data sets. The results show that CML outperforms all the studied initialization compensation techniques.

[1]  Thomas Fang Zheng,et al.  Max-margin metric learning for speaker recognition , 2016, 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[2]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Douglas E. Sturim,et al.  SVM Based Speaker Verification using a GMM Supervector Kernel and NAP Variability Compensation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[4]  Andreas Stolcke,et al.  Within-class covariance normalization for SVM-based speaker recognition , 2006, INTERSPEECH.

[5]  Alan McCree,et al.  Improving speaker recognition performance in the domain adaptation challenge using deep neural networks , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[6]  James R. Glass,et al.  Bayesian distance metric learning on i-vector for speaker verification , 2013, INTERSPEECH.

[7]  John H. L. Hansen,et al.  Maximum Likelihood Acoustic Factor Analysis Models for Robust Speaker Verification in Noise , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Thomas Fang Zheng,et al.  Deep speaker verification: Do we need end to end? , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[9]  Li Bai,et al.  Cosine Similarity Metric Learning for Face Verification , 2010, ACCV.

[10]  Man-Wai Mak,et al.  SNR-Invariant PLDA Modeling in Nonparametric Subspace for Robust Speaker Verification , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Georg Heigold,et al.  End-to-end text-dependent speaker verification , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Sanjeev Khudanpur,et al.  Deep neural network-based speaker embeddings for end-to-end speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[14]  Rajesh M. Hegde,et al.  Cosine Distance Metric Learning for Speaker Verification Using Large Margin Nearest Neighbor Method , 2014, PCM.

[15]  Themos Stafylakis,et al.  Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition , 2014, Odyssey.

[16]  Larry P. Heck,et al.  MSR Identity Toolbox v1.0: A MATLAB Toolbox for Speaker Recognition Research , 2013 .

[17]  Douglas A. Reynolds,et al.  A unified deep neural network for speaker and language recognition , 2015, INTERSPEECH.

[18]  Quan Wang,et al.  Generalized End-to-End Loss for Speaker Verification , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  DeLiang Wang,et al.  Robust speaker recognition based on DNN/i-vectors and speech separation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).