论文信息 - System combination for short utterance speaker recognition

System combination for short utterance speaker recognition

For text-independent short-utterance speaker recognition (SUSR), the performance often degrades dramatically. This paper presents a combination approach to the SUSR tasks with two phonetic-aware systems: one is the DNN-based i-vector system and the other is our recently proposed subregion-based GMM-UBM system. The former employs phone posteriors to construct an i-vector model in which the shared statistics offers stronger robustness against limited test data, while the latter establishes a phone-dependent GMM-UBM system which represents speaker characteristics with more details. A score-level fusion is implemented to integrate the respective advantages from the two systems. Experimental results show that for the text-independent SUSR task, both the DNN-based i-vector system and the subregion-based GMM-UBM system outperform their respective baselines, and the score-level system combination delivers performance improvement.

Thomas Fang Zheng | Xiaodong Zhang | Lantian Li | Dong Wang | Panshi Jin

[1] Wang Dong. Multi-Layer Channel Normalization for Frequency-Dynamic Feature Extraction , 2003 .

[2] Vincent M. Stanford,et al. The 2021 NIST Speaker Recognition Evaluation , 2022, Odyssey.

[3] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[4] Sridha Sridharan,et al. i-vector Based Speaker Recognition on Short Utterances , 2011, INTERSPEECH.

[5] Douglas A. Reynolds,et al. A Tutorial on Text-Independent Speaker Verification , 2004, EURASIP J. Adv. Signal Process..

[6] Thomas Fang Zheng,et al. Improving Short Utterance Speaker Recognition by Modeling Speech Unit Classes , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7] Thomas Fang Zheng,et al. A K-phoneme-class based multi-model method for short utterance speaker recognition , 2012, Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference.

[8] Alvin F. Martin,et al. The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[9] Chin-Hui Lee,et al. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[10] Eliathamby Ambikairajah,et al. A segment selection technique for speaker verification , 2010, Speech Commun..

[11] Yun Lei,et al. A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] James H. Elder,et al. Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[13] A. Hall. Methods for demonstrating Resemblance in Taxonomy and Ecology , 1967, Nature.

[14] Figen Ertaş,et al. FUNDAMENTALS OF SPEAKER RECOGNITION , 2011 .

[15] Sridha Sridharan,et al. Making Confident Speaker Verification Decisions With Minimal Speech , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[16] Mohamed Kamal Omar,et al. Training Universal Background Models for Speaker Recognition , 2010, Odyssey.

[17] Bin Ma,et al. The RSR2015: Database for Text-Dependent Speaker Verification using Multiple Pass-Phrases , 2012, Interspeech 2012.

[18] Haizhou Li,et al. An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[19] Sridha Sridharan,et al. Factor analysis modelling for speaker verification with short utterances , 2008, Odyssey.

[20] Charles Elkan,et al. Expectation Maximization Algorithm , 2010, Encyclopedia of Machine Learning.

[21] Themos Stafylakis,et al. Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition , 2014, Odyssey.

[22] Man-Wai Mak,et al. A Comparison of Various Adaptation Methods for Speaker Verification With Limited Enrollment Data , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[23] Thomas Fang Zheng,et al. Improved context-dependent acoustic modeling for continuous Chinese speech recognition , 2001, INTERSPEECH.