Cross-lingual speaker verification with deep feature learning

Existing speaker verification (SV) systems often suffer from performance degradation if there is any language mismatch between model training, speaker enrollment, and test. A major cause of this degradation is that most existing SV methods rely on a probabilistic model to infer the speaker factor, so any significant change on the distribution of the speech signal will impact the inference. Recently, we proposed a deep learning model that can learn how to extract the speaker factor by a deep neural network (DNN). By this feature learning, an SV system can be constructed with a very simple back-end model. In this paper, we investigate the robustness of the feature-based SV system in situations with language mismatch. Our experiments were conducted on a complex cross-lingual scenario, where the model training was in English, and the enrollment and test were in Chinese or Uyghur. The experiments demonstrated that the feature-based system outperformed the i-vector system with a large margin, particularly with language mismatch between enrollment and test.

[1]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[2]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[3]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[6]  John H. L. Hansen,et al.  Spoken language mismatch in speaker verification: An investigation with NIST-SRE and CRSS Bi-Ling corpora , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[7]  Thomas Fang Zheng,et al.  Cross-lingual speaker verification based on linear transform , 2015, 2015 IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP).

[8]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[9]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[10]  Roland Auckenthaler,et al.  Language dependency in text-independent speaker verification , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[11]  Yifan Gong,et al.  End-to-End attention based text-dependent speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[12]  Michael T. Johnson,et al.  Vocal source features for bilingual speaker identification , 2013, 2013 IEEE China Summit and International Conference on Signal and Information Processing.

[13]  Liang Lu,et al.  The effect of language factors for robust speaker recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Xiao Liu,et al.  Deep Speaker: an End-to-End Neural Speaker Embedding System , 2017, ArXiv.

[15]  Thomas Fang Zheng,et al.  Language-aware PLDA for multilingual speaker recognition , 2016, 2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA).

[16]  John H. L. Hansen,et al.  Speaker Recognition by Machines and Humans: A tutorial review , 2015, IEEE Signal Processing Magazine.

[17]  John H. L. Hansen,et al.  Language Normalization for Bilingual Speaker Recognition Systems , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[18]  Douglas A. Reynolds,et al.  An overview of automatic speaker recognition technology , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Sergey Ioffe,et al.  Probabilistic Linear Discriminant Analysis , 2006, ECCV.

[20]  Sanjeev Khudanpur,et al.  Parallel training of DNNs with Natural Gradient and Parameter Averaging , 2014 .

[21]  Sanjeev Khudanpur,et al.  Deep neural network-based speaker embeddings for end-to-end speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[22]  Dong Wang,et al.  Deep Speaker Feature Learning for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[23]  Bin Ma,et al.  English-Chinese bilingual text-independent speaker verification , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.