The effect of language factors for robust speaker recognition

From the results of the NIST speaker recognition evaluation in resent years, speaker recognition systems which are mainly developed based on English training data suffer the language gap problem, namely, the performance of non-English trails is much worse than that of English trails. This problem is addressed in this paper. Based on the conventional joint factor analysis model, we enrolled in the language factors which are mean to capture the language character of each testing and training speech utterance, and compensation was carried out by removing the language factors in order to shrink the difference between languages. Experiments on 2006 NIST SRE data show that, the language factor compensation alone can reduce the gap between the performance of English and non-English trails, and the score level combination with eigenchannels can further improve the performance of non-English trails, e.g., for female part, we observed about 19% relatively reduction in EER, when compared with eigenchannels session variability compensation alone.

[1]  David A. van Leeuwen,et al.  Fusion of Heterogeneous Speaker Recognition Systems in the STBU Submission for the NIST Speaker Recognition Evaluation 2006 , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Ronald A. Cole,et al.  The OGI multi-language telephone speech corpus , 1992, ICSLP.

[3]  Driss Matrouf,et al.  A straightforward and efficient implementation of the factor analysis model for speaker verification , 2007, INTERSPEECH.

[4]  William M. Campbell,et al.  Advances in channel compensation for SVM speaker recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[5]  Patrick Kenny,et al.  Joint Factor Analysis of Speaker and Session Variability: Theory and Algorithms , 2006 .

[6]  Liang Lu,et al.  Nonlinear kernel nuisance attribute projection for speaker verification , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Alvin F. Martin,et al.  NIST Speaker Recognition Evaluations Utilizing the Mixer Corpora—2004, 2005, 2006 , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Patrick Kenny,et al.  A Study of Interspeaker Variability in Speaker Verification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Patrick Kenny,et al.  Eigenvoice modeling with sparse training data , 2005, IEEE Transactions on Speech and Audio Processing.

[10]  Douglas E. Sturim,et al.  SVM Based Speaker Verification using a GMM Supervector Kernel and NAP Variability Compensation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[11]  Liang Lu,et al.  Analysis of subspace within-class covariance normalization for SVM-based speaker verification , 2008, INTERSPEECH.