Weaknesses of voice biometrics - sensitivity of Speaker verification to emotional arousal

In our series of experiments we study weaknesses of the voice biometric systems and try to find solutions to improve their robustness. The acoustical features that represent human voices in the current automatic speaker verification systems change significantly when the person’s emotional arousal deviates from the neutral state. Speech templates of a given speaker used for enrollment are generally recorded in a neutral emotional state using "normal" speech effort. Therefore speaking with higher or lower voice tension causes a mismatch between training and testing resulting in a higher number of verification errors. The acoustical cues of increased emotional arousal in speech are highly non-specific. They are similar to those of Lombard speech, warning and insisting voice, emergency voice, extreme acute stress, shouting, and emotions like anger, fear, hate, and many others. As the available spontaneous emotional speech databases do not cover the full range of the emotional arousal for individual voices, and do not have enough utterances per speaker, we decided to use our CRISIS acted database containing speech utterances at six levels of tense emotional arousal per speaker. Sensitivity of the state of the art i-vector based speaker recognizer with PLDA scoring to arousal mismatch was validated. The speaker verification system was successfully implemented in the online “Speaker authorization” module developed in the frame of the European project Global ATM Security Management (GAMMA). It has been observed that at extreme arousal levels the reliability of the verification decreases. Mixed enrollments with various levels of arousal were used to create more robust models and have shown a promising improvement in the verification reliability compared to the baseline.

[1]  Sridha Sridharan,et al.  PLDA based speaker recognition on short utterances , 2012, Odyssey.

[2]  R. Thayer The biopsychology of mood and arousal , 1989 .

[3]  Marián Trnka,et al.  Expressive Speech Synthesis for Critical Situations , 2014, Comput. Informatics.

[4]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  K. Scherer Vocal affect expression: a review and a model for future research. , 1986, Psychological bulletin.

[6]  Martti Vainio,et al.  Hyperarticulation in Lombard speech: Global coordination of the jaw, lips and the tongue. , 2016, The Journal of the Acoustical Society of America.

[7]  Jozef Juhar,et al.  Using current biometrics technologies for authentication in e-learning assessment , 2016, 2016 International Conference on Emerging eLearning Technologies and Applications (ICETA).