Timing levels in segment-based speech emotion recognition

Additional sub-phrase level information is believed to improve accuracy in speech emotion recognition systems. Yet, automatic segmentation is a challenge on its own considering wordor syllable boundaries. Further more clarification is needed which timing level leads to optimal results. In this paper we therefore quantitatively discuss three approaches to segment-level features based on 276 statistical hi-level prosodic, articulatory and speech quality features. Apart from the choice of the optimal segmentation scheme also fusion of segments with respect to classification and combination of diverse timing levels is analyzed. Tests are carried out on the popular Berlin Database of Emotional Speech (EMO-DB). Significant improvement over existing works can be reported for combination of phrase-level features with relative time interval features.

[1]  Björn Schuller,et al.  Speech Communication and Multimodal Interfaces , 2006 .

[2]  Diane J. Litman,et al.  Using word-level pitch features to better predict student emotions during spoken tutoring dialogues , 2005, INTERSPEECH.

[3]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[4]  George N. Votsis,et al.  Emotion recognition in human-computer interaction , 2001, IEEE Signal Process. Mag..

[5]  Elizabeth Shriberg,et al.  Spontaneous speech: how people really talk and why engineers should care , 2005, INTERSPEECH.

[6]  Elisabeth André,et al.  Comparing Feature Sets for Acted and Spontaneous Speech in View of Automatic Emotion Recognition , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[7]  Karl-Friedrich Kraiss,et al.  Advanced Man-Machine Interaction , 2006 .

[8]  L. Rothkrantz,et al.  Toward an affect-sensitive multimodal human-computer interaction , 2003, Proc. IEEE.

[9]  Elmar Nöth,et al.  Tales of tuning - prototyping for automatic classification of emotional user states , 2005, INTERSPEECH.

[10]  Björn W. Schuller,et al.  Speaker independent emotion recognition by early fusion of acoustic and linguistic features within ensembles , 2005, INTERSPEECH.

[11]  Björn W. Schuller,et al.  Hidden Markov model-based speech emotion recognition , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[12]  Mohamed S. Kamel,et al.  Segment-based approach to the recognition of emotions in speech , 2005, 2005 IEEE International Conference on Multimedia and Expo.