Cross-lingual speaker verification based on linear transform

Speaker verification suffers from serious performance degradation if the enrollment and test speech are in different languages. This degradation can be largely attributed to the different distributions of acoustic features in different languages. This paper proposes a linear transform approach which projects speech signals from its own language to another language so that the language mismatch between enrollment and test can be mitigated. The constrained maximum likelihood linear regression (CMLLR) is adopted to conduct the linear transform in the feature domain. The proposed approach has been evaluated on a Chinese-Uyghur cross-lingual speaker verification task. We collected a bilingual speech database CSLT-CUDGT2014 which consists of 113 female speakers who can speak both Standard Chinese and Uyghur. Based on this database and with the proposed linear transform, a relative improvement about 10% in the equal error rate (EER) was achieved.

[1]  Liang Lu,et al.  The effect of language factors for robust speaker recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Douglas A. Reynolds,et al.  A Tutorial on Text-Independent Speaker Verification , 2004, EURASIP J. Adv. Signal Process..

[3]  Patrick Kenny,et al.  A Study of Interspeaker Variability in Speaker Verification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  John H. L. Hansen,et al.  Language Normalization for Bilingual Speaker Recognition Systems , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[5]  Geoffrey E. Hinton,et al.  Visualizing non-metric similarities in multiple maps , 2011, Machine Learning.

[6]  Roland Auckenthaler,et al.  Language dependency in text-independent speaker verification , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[7]  Mark J. F. Gales,et al.  Mean and variance adaptation within the MLLR framework , 1996, Comput. Speech Lang..

[8]  Michael T. Johnson,et al.  Vocal source features for bilingual speaker identification , 2013, 2013 IEEE China Summit and International Conference on Signal and Information Processing.

[9]  Bin Ma,et al.  Effects of Device Mismatch, Language Mismatch and Environmental Mismatch on Speaker Verification , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[10]  Laurens van der Maaten,et al.  Accelerating t-SNE using tree-based algorithms , 2014, J. Mach. Learn. Res..

[11]  Thomas Fang Zheng,et al.  Emotional speaker verification with linear adaptation , 2013, 2013 IEEE China Summit and International Conference on Signal and Information Processing.

[12]  John H. L. Hansen,et al.  Spoken language mismatch in speaker verification: An investigation with NIST-SRE and CRSS Bi-Ling corpora , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[13]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[14]  William M. Campbell,et al.  Support vector machines for speaker and language recognition , 2006, Comput. Speech Lang..

[15]  Laurens van der Maaten,et al.  Learning a Parametric Embedding by Preserving Local Structure , 2009, AISTATS.

[16]  Bin Ma,et al.  English-Chinese bilingual text-independent speaker verification , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.