Learning Virtual HD Model for Bi-model Emotional Speaker Recognition

Pitch mismatch between training and testing is one of the important factors causing the performance degradation of the speaker recognition system. In this paper, we adopted the missing feature theory and specified the Unreliable Region (UR) as the parts of the utterance with high emotion induced pitch variation. To model these regions, a virtual HD (High Different from neutral, with large pitch offset) model for each target speaker was built from the virtual speech, which were converted from the neutral speech by the Pitch Transformation Algorithm (PTA). In the PTA, a polynomial transformation function was learned to model the relationship of the average pitch between the neutral and the high-pitched utterances. Compared with traditional GMM-UBM and our previous method, our new method obtained 1.88% and 0.84% identification rate (IR) increase on the MASC respectively, which are promising results.

[1]  Yingchun Yang,et al.  Applying pitch-dependent difference detection and modification to emotional speaker recognition , 2008, INTERSPEECH.

[2]  Andrzej Drygajlo,et al.  Speaker verification in noisy environments with combined spectral subtraction and missing feature theory , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[3]  Brian Kingsbury,et al.  Pseudo Pitch Synchronous Analysis of Speech With Applications to Speaker Recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  D.G. Childers,et al.  Measuring and modeling vocal source-tract interaction , 1994, IEEE Transactions on Biomedical Engineering.

[5]  Samuel Kim,et al.  A pitch synchronous feature extraction method for speaker recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Douglas A. Reynolds,et al.  The NIST speaker recognition evaluation - Overview, methodology, systems, results, perspective , 2000, Speech Commun..

[7]  Klaus R. Scherer,et al.  Can automatic speaker verification be improved by training the algorithms on emotional speech? , 2000, INTERSPEECH.

[8]  Zhaohui Wu,et al.  MASC: A Speech Corpus in Mandarin for Emotion Analysis and Affective Speaker Recognition , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[9]  H. Akaike A new look at the statistical model identification , 1974 .

[10]  Douglas A. Reynolds,et al.  On the influence of rate, pitch, and spectrum on automatic speaker recognition performance , 2000, INTERSPEECH.

[11]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..