LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework

Automatically recognizing human emotions from spontaneous and non-prototypical real-life data is currently one of the most challenging tasks in the field of affective computing. This article presents our recent advances in assessing dimensional representations of emotion, such as arousal, expectation, power, and valence, in an audiovisual human-computer interaction scenario. Building on previous studies which demonstrate that long-range context modeling tends to increase accuracies of emotion recognition, we propose a fully automatic audiovisual recognition approach based on Long Short-Term Memory (LSTM) modeling of word-level audio and video features. LSTM networks are able to incorporate knowledge about how emotions typically evolve over time so that the inferred emotion estimates are produced under consideration of an optimal amount of context. Extensive evaluations on the Audiovisual Sub-Challenge of the 2011 Audio/Visual Emotion Challenge show how acoustic, linguistic, and visual features contribute to the recognition of different affective dimensions as annotated in the SEMAINE database. We apply the same acoustic features as used in the challenge baseline system whereas visual features are computed via a novel facial movement feature extractor. Comparing our results with the recognition scores of all Audiovisual Sub-Challenge participants, we find that the proposed LSTM-based technique leads to the best average recognition performance that has been reported for this task so far.

[1]  Björn W. Schuller,et al.  Bidirectional LSTM Networks for Context-Sensitive Keyword Detection in a Cognitive Virtual Agent Framework , 2010, Cognitive Computation.

[2]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[3]  Laurence Devillers,et al.  Real-life emotion-related states detection in call centers: a cross-corpora study , 2010, INTERSPEECH.

[4]  Björn W. Schuller,et al.  A multi-stream ASR framework for BLSTM modeling of conversational speech , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Maja Pantic,et al.  The SEMAINE corpus of emotionally coloured character interactions , 2010, 2010 IEEE International Conference on Multimedia and Expo.

[6]  Björn W. Schuller,et al.  Emotion representation, analysis and synthesis in continuous space: A survey , 2011, Face and Gesture 2011.

[7]  Björn W. Schuller,et al.  Acoustic emotion recognition: A benchmark comparison of performances , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[8]  Qingshan Liu,et al.  Boosting encoded dynamic features for facial expression recognition , 2009, Pattern Recognit. Lett..

[9]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[10]  Gary Bradski,et al.  Computer Vision Face Tracking For Use in a Perceptual User Interface , 1998 .

[11]  Jürgen Schmidhuber,et al.  Learning to forget: continual prediction with LSTM , 1999 .

[12]  Shrikanth S. Narayanan,et al.  Support Vector Regression for Automatic Recognition of Spontaneous Emotions in Speech , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[13]  Ioannis Pitas,et al.  Facial Expression Recognition in Image Sequences Using Geometric Deformation Features and Support Vector Machines , 2007, IEEE Transactions on Image Processing.

[14]  Jean Meunier,et al.  Continuous Emotion Recognition Using Gabor Energy Filters , 2011, ACII.

[15]  Zhu Liang Yu,et al.  Speech Emotion Recognition System Based on L1 Regularized Linear Regression and Decision Fusion , 2011, ACII.

[16]  Björn Schuller,et al.  Being bored? Recognising natural interest by extensive audiovisual integration for real-life application , 2009, Image Vis. Comput..

[17]  Björn Schuller,et al.  On-line Driver Distraction Detection using Long Short-Term Memory , 2011 .

[18]  Anil K. Jain,et al.  Handbook of Face Recognition, 2nd Edition , 2011 .

[19]  Jeff A. Bilmes,et al.  The design and collection of COSINE, a multi-microphone in situ speech corpus recorded in noisy environments , 2012, Comput. Speech Lang..

[20]  F. Gosselin,et al.  Audio-visual integration of emotion expression , 2008, Brain Research.

[21]  Björn W. Schuller,et al.  AVEC 2011-The First International Audio/Visual Emotion Challenge , 2011, ACII.

[22]  S. C. Kremer,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[23]  Mark A. Clements,et al.  Investigating the Use of Formant Based Features for Detection of Affective Dimensions in Speech , 2011, ACII.

[24]  Jan Cernocký,et al.  Probabilistic and Bottle-Neck Features for LVCSR of Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[25]  Björn W. Schuller,et al.  Robust discriminative keyword spotting for emotionally colored spontaneous speech using bidirectional LSTM networks , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[26]  Yajie Tian,et al.  Handbook of face recognition , 2003 .

[27]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[28]  Björn W. Schuller,et al.  Tandem decoding of children's speech for keyword detection in a child-robot interaction scenario , 2011, TSLP.

[29]  Yoichi Sato,et al.  Pose-Invariant Facial Expression Recognition Using Variable-Intensity Templates , 2007, International Journal of Computer Vision.

[30]  K. Scherer,et al.  The World of Emotions is not Two-Dimensional , 2007, Psychological science.

[31]  Matti Pietikäinen,et al.  Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[33]  Ya Li,et al.  The CASIA Audio Emotion Recognition Method for Audio/Visual Emotion Challenge 2011 , 2011, ACII.

[34]  Björn W. Schuller,et al.  The INTERSPEECH 2011 Speaker State Challenge , 2011, INTERSPEECH.

[35]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[36]  Tatsuya Kawahara,et al.  Recent Development of Open-Source Speech Recognition Engine Julius , 2009 .

[37]  Björn W. Schuller,et al.  Combining Long Short-Term Memory and Dynamic Bayesian Networks for Incremental Emotion-Sensitive Artificial Listening , 2010, IEEE Journal of Selected Topics in Signal Processing.

[38]  G. Miller Learning to Forget , 2004, Science.

[39]  Fakhri Karray,et al.  Audio-Based Emotion Recognition from Natural Conversations Based on Co-Occurrence Matrix and Frequency Domain Energy Distribution Features , 2011, ACII.

[40]  Björn W. Schuller,et al.  Building Autonomous Sensitive Artificial Listeners , 2012, IEEE Transactions on Affective Computing.

[41]  Mann Oo. Hay Emotion recognition in human-computer interaction , 2012 .

[42]  Angeliki Metallinou,et al.  Audio-Visual Emotion Recognition Using Gaussian Mixture Models for Face and Voice , 2008, 2008 Tenth IEEE International Symposium on Multimedia.

[43]  Björn W. Schuller,et al.  A multidimensional dynamic time warping algorithm for efficient multimodal fusion of asynchronous data streams , 2009, Neurocomputing.

[44]  Björn W. Schuller,et al.  A novel bottleneck-BLSTM front-end for feature-level context modeling in conversational speech recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[45]  Darryl Greig,et al.  Video object detection speedup using staggered sampling , 2009, 2009 Workshop on Applications of Computer Vision (WACV).

[46]  Louis-Philippe Morency,et al.  Modeling Latent Discriminative Dynamic of Multi-dimensional Affective Signals , 2011, ACII.

[47]  Björn W. Schuller,et al.  Brute-forcing hierarchical functionals for paralinguistics: A waste of feature space? , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[48]  Shaogang Gong,et al.  Facial expression recognition based on Local Binary Patterns: A comprehensive study , 2009, Image Vis. Comput..

[49]  G. W. Snedecor Statistical Methods , 1964 .

[50]  L. F. Barrett,et al.  Context Is Routinely Encoded During Emotion Perception , 2010, Psychological science.

[51]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[52]  Tom Ziemke,et al.  On the Role of Emotion in Embodied Cognitive Architectures: From Organisms to Robots , 2009, Cognitive Computation.

[53]  Björn W. Schuller,et al.  On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues , 2009, Journal on Multimodal User Interfaces.

[54]  Bir Bhanu,et al.  A Psychologically-Inspired Match-Score Fusion Model for Video-Based Facial Expression Recognition , 2011, ACII.

[55]  Nadia Bianchi-Berthouze,et al.  Naturalistic Affective Expression Classification by a Multi-stage Approach Based on Hidden Markov Models , 2011, ACII.

[56]  Samy Bengio,et al.  An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition , 2002, NIPS.

[57]  G. Rigoll,et al.  A hierarchical approach for visual suspicious behavior detection in aircrafts , 2009, 2009 16th International Conference on Digital Signal Processing.

[58]  Hatice Gunes,et al.  Continuous Prediction of Spontaneous Affect from Multiple Cues and Modalities in Valence-Arousal Space , 2011, IEEE Transactions on Affective Computing.

[59]  Björn W. Schuller,et al.  Analyzing the memory of BLSTM Neural Networks for enhanced emotion classification in dyadic spoken interactions , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[60]  Jürgen Schmidhuber,et al.  An Application of Recurrent Neural Networks to Discriminative Keyword Spotting , 2007, ICANN.

[61]  Kostas Karpouzis,et al.  The HUMAINE Database: Addressing the Collection and Annotation of Naturalistic and Induced Emotional Data , 2007, ACII.

[62]  Zhihong Zeng,et al.  Audio–Visual Affective Expression Recognition Through Multistream Fused HMM , 2008, IEEE Transactions on Multimedia.

[63]  Björn W. Schuller,et al.  String-based audiovisual fusion of behavioural events for the assessment of dimensional affect , 2011, Face and Gesture 2011.

[64]  Ian Witten,et al.  Data Mining , 2000 .

[65]  Björn W. Schuller,et al.  The hinterland of emotions: Facing the open-microphone challenge , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[66]  Björn W. Schuller,et al.  Abandoning emotion classes - towards continuous emotion recognition with modelling of long-range dependencies , 2008, INTERSPEECH.

[67]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[68]  Björn W. Schuller,et al.  Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional LSTM modeling , 2010, INTERSPEECH.

[69]  Markus Kächele,et al.  Multiple Classifier Systems for the Classification of Audio-Visual Emotional States , 2011, ACII.

[70]  Björn W. Schuller,et al.  Online Driver Distraction Detection Using Long Short-Term Memory , 2011, IEEE Transactions on Intelligent Transportation Systems.

[71]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[72]  Ioannis Pitas,et al.  An analysis of facial expression recognition under partial facial image occlusion , 2008, Image Vis. Comput..

[73]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[74]  Björn W. Schuller,et al.  Hidden Markov model-based speech emotion recognition , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).