Multimodal emotion recognition in audiovisual communication

This paper discusses innovative techniques to automatically estimate a user's emotional state analyzing the speech signal and haptical interaction on a touch-screen or via mouse. The knowledge of a user's emotion permits adaptive strategies striving for a more natural and robust interaction. We classify seven emotional states: surprise, joy, anger, fear, disgust, sadness, and neutral user state. The user's emotion is extracted by a parallel stochastic analysis of his spoken and haptical machine interactions while understanding the desired intention. The introduced methods are based on the common prosodic speech features pitch and energy, but rely also on the semantic and intention based features wording, degree of verbosity, temporal intention and word rate, and finally the history of user utterances. As further modality even touch-screen or mouse interaction is analyzed. The estimates based on these features are integrated in a multimodal way. The introduced methods are based on results of user studies. A realization proved to be reliable compared with subjective probands' impressions.

[1]  Fumitada Itakura,et al.  Distance measure for speech recognition based on the smoothed group delay spectrum , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Barbara Heuft,et al.  Emotions in time domain synthesis , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[3]  T. S. Polzin,et al.  Verbal and non-verbal cues in the communication of emotions , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[4]  Roddy Cowie,et al.  Automatic statistical analysis of the signal and prosodic signs of emotion in speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[5]  Shrikanth S. Narayanan,et al.  Politeness and frustration language in child-machine interactions , 2001, INTERSPEECH.

[6]  Andreas Kießling,et al.  Extraktion und Klassifikation prosodischer Merkmale in der automatischen Sprachverarbeitung / Andreas Kiessling , 1997 .

[7]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[8]  Noam Amir,et al.  Classifying emotions in speech: a comparison of methods , 2001, INTERSPEECH.

[9]  Albino Nogueiras,et al.  Speech emotion recognition using hidden Markov models , 2001, INTERSPEECH.

[10]  Say Wei Foo,et al.  Speech emotion recognition using hidden Markov models , 2003, Speech Commun..

[11]  Björn Schuller,et al.  Navigation in virtual worlds via natural speech , 2001 .

[12]  N. Amir,et al.  Towards an automatic classification of emotions in speech , 1998, ICSLP.

[13]  George N. Votsis,et al.  Emotion recognition in human-computer interaction , 2001, IEEE Signal Process. Mag..