On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues

For many applications of emotion recognition, such as virtual agents, the system must select responses while the user is speaking. This requires reliable on-line recognition of the user’s affect. However most emotion recognition systems are based on turnwise processing. We present a novel approach to on-line emotion recognition from speech using Long Short-Term Memory Recurrent Neural Networks. Emotion is recognised frame-wise in a two-dimensional valence-activation continuum. In contrast to current state-of-the-art approaches, recognition is performed on low-level signal frames, similar to those used for speech recognition. No statistical functionals are applied to low-level feature contours. Framing at a higher level is therefore unnecessary and regression outputs can be produced in real-time for every low-level input frame. We also investigate the benefits of including linguistic features on the signal frame level obtained by a keyword spotter.

[1]  Paul J. Werbos,et al.  Backpropagation Through Time: What It Does and How to Do It , 1990, Proc. IEEE.

[2]  Geoffrey E. Hinton,et al.  A time-delay neural network architecture for isolated word recognition , 1990, Neural Networks.

[3]  Jürgen Schmidhuber,et al.  Learning Complex, Extended Sequences Using the Principle of History Compression , 1992, Neural Computation.

[4]  Martin A. Riedmiller,et al.  A direct adaptive method for faster backpropagation learning: the RPROP algorithm , 1993, IEEE International Conference on Neural Networks.

[5]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[6]  Peter Tiño,et al.  Learning long-term dependencies in NARX recurrent neural networks , 1996, IEEE Trans. Neural Networks.

[7]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[8]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[9]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[10]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[11]  Roddy Cowie,et al.  FEELTRACE: an instrument for recording perceived emotion in real time , 2000 .

[12]  Florian Schiel,et al.  Development of the UserState Conventions for the Multimodal Corpus in SmartKom , 2002 .

[13]  Carol O'Sullivan,et al.  Synthetic Vision and Memory for Autonomous Virtual Humans , 2002, Comput. Graph. Forum.

[14]  Björn W. Schuller,et al.  Hidden Markov model-based speech emotion recognition , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[15]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[16]  Jürgen Schmidhuber,et al.  Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition , 2005, ICANN.

[17]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[18]  Wolfgang Wahlster,et al.  SmartKom: Foundations of Multimodal Dialogue Systems , 2006, SmartKom.

[19]  Björn W. Schuller,et al.  Timing levels in segment-based speech emotion recognition , 2006, INTERSPEECH.

[20]  Anton Batliner,et al.  Emotion Analysis and Emotion-Handling Subdialogues , 2006, SmartKom.

[21]  Loïc Kessous,et al.  Modeling naturalistic affective states via facial and vocal expressions recognition , 2006, ICMI '06.

[22]  Björn W. Schuller,et al.  Evolutionary Feature Generation in Speech Emotion Recognition , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[23]  Marcus Liwicki,et al.  A novel approach to on-line handwriting recognition based on bidirectional long short-term memory networks , 2007 .

[24]  Björn W. Schuller,et al.  Frame vs. Turn-Level: Emotion Recognition from Speech Considering Static and Dynamic Processing , 2007, ACII.

[25]  Jürgen Schmidhuber,et al.  Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks , 2007, NIPS.

[26]  Björn W. Schuller,et al.  What Should a Generic Emotion Markup Language Be Able to Represent? , 2007, ACII.

[27]  A. Graves,et al.  Unconstrained Online Handwriting Recognition with Recurrent Neural Networks , 2007 .

[28]  Shrikanth S. Narayanan,et al.  Support Vector Regression for Automatic Recognition of Spontaneous Emotions in Speech , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[29]  Björn W. Schuller,et al.  Comparing one and two-stage acoustic modeling in the recognition of emotion in speech , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[30]  Jürgen Schmidhuber,et al.  An Application of Recurrent Neural Networks to Discriminative Keyword Spotting , 2007, ICANN.

[31]  Steffen Udluft,et al.  Learning long-term dependencies with recurrent neural networks , 2008, Neurocomputing.

[32]  Jürgen Schmidhuber,et al.  Phoneme recognition in TIMIT with BLSTM-CTC , 2008, ArXiv.

[33]  Björn W. Schuller,et al.  Patterns, prototypes, performance: classifying emotional user states , 2008, INTERSPEECH.

[34]  E. Nöth,et al.  Releasing a thoroughly annotated and processed spontaneous emotional database : the FAU Aibo Emotion Corpus , 2008 .

[35]  Björn W. Schuller,et al.  Brute-forcing hierarchical functionals for paralinguistics: A waste of feature space? , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[36]  Russell Beale,et al.  Affect and Emotion in Human-Computer Interaction, From Theory to Applications , 2008, Affect and Emotion in Human-Computer Interaction.

[37]  Björn W. Schuller,et al.  Abandoning emotion classes - towards continuous emotion recognition with modelling of long-range dependencies , 2008, INTERSPEECH.

[38]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[39]  Dirk Heylen,et al.  Towards responsive Sensitive Artificial Listeners , 2008 .

[40]  Loïc Kessous,et al.  Emotion Recognition through Multiple Modalities: Face, Body Gesture, Speech , 2008, Affect and Emotion in Human-Computer Interaction.

[41]  Shrikanth S. Narayanan,et al.  The Vera am Mittag German audio-visual emotional speech database , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[42]  Björn W. Schuller,et al.  Robust in-car spelling recognition - a tandem BLSTM-HMM approach , 2009, INTERSPEECH.

[43]  Björn W. Schuller,et al.  Robust discriminative keyword spotting for emotionally colored spontaneous speech using bidirectional LSTM networks , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[44]  Stefan Steidl,et al.  Automatic classification of emotion related user states in spontaneous children's speech , 2009 .

[45]  Zhihong Zeng,et al.  A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Björn W. Schuller,et al.  Data-driven clustering in emotional space for affect recognition using discriminatively trained LSTM networks , 2009, INTERSPEECH.

[47]  Björn W. Schuller,et al.  Acoustic emotion recognition: A benchmark comparison of performances , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[48]  Björn Schuller,et al.  Being bored? Recognising natural interest by extensive audiovisual integration for real-life application , 2009, Image Vis. Comput..

[49]  Björn W. Schuller,et al.  Robust vocabulary independent keyword spotting with graphical models , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[50]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[51]  Björn W. Schuller,et al.  A multidimensional dynamic time warping algorithm for efficient multimodal fusion of asynchronous data streams , 2009, Neurocomputing.

[52]  Björn W. Schuller,et al.  OpenEAR — Introducing the munich open-source emotion and affect recognition toolkit , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[53]  Catherine Pelachaud,et al.  The HUMAINE Database , 2011 .