Data-driven clustering in emotional space for affect recognition using discriminatively trained LSTM networks

In today’s affective databases speech turns are often labelled on a continuous scale for emotional dimensions such as valence or arousal to better express the diversity of human affect. However, applications like virtual agents usually map the detected emotional user state to rough classes in order to reduce the multiplicity of emotion dependent system responses. Since these classes often do not optimally reflect emotions that typically occur in a given application, this paper investigates data-driven clustering of emotional space to find class divisions that better match the training data and the area of application. Thereby we consider the Belfast Sensitive Artificial Listener database and TV talkshow data from the VAM corpus. We show that a discriminatively trained Long Short-Term Memory (LSTM) recurrent neural net that explicitly learns clusters in emotional space and additionally models context information outperforms both, Support Vector Machines and a Regression-LSTM net. Index Terms: emotion recognition, affective databases, long short-term memory

[1]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[2]  Roddy Cowie,et al.  FEELTRACE: an instrument for recording perceived emotion in real time , 2000 .

[3]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[4]  Björn W. Schuller,et al.  Brute-forcing hierarchical functionals for paralinguistics: A waste of feature space? , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Stefanos D. Kollias,et al.  An intelligent system for facial emotion recognition , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[6]  R. Plutchik Emotion, a psychoevolutionary synthesis , 1980 .

[7]  Elmar Nöth,et al.  “You Stupid Tin Box” - Children Interacting with the AIBO Robot: A Cross-linguistic Emotional Speech Corpus , 2004, LREC.

[8]  Björn W. Schuller,et al.  Abandoning emotion classes - towards continuous emotion recognition with modelling of long-range dependencies , 2008, INTERSPEECH.

[9]  Kostas Karpouzis,et al.  The HUMAINE Database: Addressing the Collection and Annotation of Naturalistic and Induced Emotional Data , 2007, ACII.

[10]  Jürgen Schmidhuber,et al.  Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition , 2005, ICANN.

[11]  Shrikanth S. Narayanan,et al.  The Vera am Mittag German audio-visual emotional speech database , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[12]  Björn W. Schuller,et al.  Robust discriminative keyword spotting for emotionally colored spontaneous speech using bidirectional LSTM networks , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Björn Schuller,et al.  Being bored? Recognising natural interest by extensive audiovisual integration for real-life application , 2009, Image Vis. Comput..