Emotional adaptive training for speaker verification

Speaker verification suffers from significant performance degradation with emotion variation. In a previous study, we have demonstrated that an adaptation approach based on MLLR/CMLLR can provide a significant performance improvement for verification on emotional speech. This paper follows this direction and presents an emotional adaptive training (EAT) approach. This approach iteratively estimates the emotion-dependent CMLLR transformations and re-trains the speaker models with the transformed speech, which therefore can make use of emotional enrollment speech to train a stronger speaker model. This is similar to the speaker adaptive training (SAT) in speech recognition. The experiments are conducted on an emotional speech database which involves speech recordings of 30 speakers in 5 emotions. The results demonstrate that the EAT approach provides significant performance improvements over the baseline system where the neutral enrollment data are used to train the speaker models and the emotional test utterances are verified directly. The EAT also outperforms another two emotionadaptation approaches in a significant way: (1) the CMLLR-based approach where the speaker models are trained with the neutral enrollment speech and the emotional test utterances are transformed by CMLLR in verification; (2) the MAP-based approach where the emotional enrollment data are used to train emotion-dependent speaker models and the emotional utterances are verified based on the emotion-matched models.

[1]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[2]  Andreas Stolcke,et al.  Speaker Recognition With Session Variability Normalization Based on MLLR Adaptation Transforms , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Klaus R. Scherer,et al.  Acoustic correlates of task load and stress , 2002, INTERSPEECH.

[4]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[5]  Elisabeth Zetterholm Prosody and voice quality in the expression of emotions , 1998, ICSLP.

[6]  K. Scherer,et al.  THE EFFECTS OF EMOTIONS ON VOICE QUALITY , 1999 .

[7]  Thomas Fang Zheng,et al.  Emotional speaker verification with linear adaptation , 2013, 2013 IEEE China Summit and International Conference on Signal and Information Processing.

[8]  Mark J. F. Gales Cluster adaptive training of hidden Markov models , 2000, IEEE Trans. Speech Audio Process..

[9]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[10]  I. Shahin Speaker Identification in Emotional Environments , 2010 .

[11]  Catherine I. Watson,et al.  Some acoustic characteristics of emotion , 1998, ICSLP.

[12]  Mark J. F. Gales,et al.  Mean and variance adaptation within the MLLR framework , 1996, Comput. Speech Lang..

[13]  Klaus R. Scherer,et al.  Can automatic speaker verification be improved by training the algorithms on emotional speech? , 2000, INTERSPEECH.

[14]  Thomas Fang Zheng,et al.  Study on speaker verification on emotional speech , 2006, INTERSPEECH.

[15]  Yingchun Yang,et al.  Learning polynomial function based neutral-emotion GMM transformation for emotional speaker recognition , 2008, 2008 19th International Conference on Pattern Recognition.

[16]  Zhaohui Wu,et al.  Improving Speaker Recognition by Training on Emotion-Added Models , 2005, ACII.