Learning Hierarchical Emotion Context for Continuous Dimensional Emotion Recognition From Video Sequences

Dimensional emotion recognition is currently one of the most challenging tasks in the field of affective computing. In this paper, a novel three-stage method is proposed to learn hierarchical emotion context information (feature- and label-level contexts) for predicting affective dimension values from video sequences. In the first stage, a feed-forward neural network is used to generate a high-level representation of the raw input features. Then, in the second stage, the bidirectional long short-term memory (BLSTM) layers learn the context information of the feature sequences from the high-level representation and get the initial recognition results of the input. Finally, in the third stage, a BLSTM neural network is used to learn the context information from emotion label sequences by an unsupervised way, which is used to correct the initial recognition results and get the final results. We also explore the influence of different sequence lengths by sampling from the original sequences. The experiment performed on the video data of AVEC 2015 demonstrates the effectiveness of the proposed method. Our framework highlights that incorporating both feature/label level dependencies and context information is a promising research direction for predicting the continuous dimensional emotion.

[1]  Andrea Cavallaro,et al.  Automatic Analysis of Facial Affect: A Survey of Registration, Representation, and Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Zhihong Zeng,et al.  Audio-Visual Affect Recognition , 2007, IEEE Transactions on Multimedia.

[3]  Stacy Marsella,et al.  Computationally modeling human emotion , 2014, CACM.

[4]  Fabien Ringeval,et al.  AV+EC 2015: The First Affect Recognition Challenge Bridging Across Audio, Video, and Physiological Data , 2015, AVEC@ACM Multimedia.

[5]  Chung-Hsien Wu,et al.  Error Weighted Semi-Coupled Hidden Markov Model for Audio-Visual Emotion Recognition , 2012, IEEE Transactions on Multimedia.

[6]  Jürgen Schmidhuber,et al.  Facial Expression Recognition with Recurrent Neural Networks , 2008 .

[7]  Hatice Gunes,et al.  A multi-layer hybrid framework for dimensional emotion classification , 2011, ACM Multimedia.

[8]  Ting Dang,et al.  An Investigation of Annotation Delay Compensation and Output-Associative Fusion for Multimodal Continuous Emotion Prediction , 2015, AVEC@ACM Multimedia.

[9]  K. Scherer,et al.  Conscious emotional experience emerges as a function of multilevel, appraisal-driven response synchronization , 2008, Consciousness and Cognition.

[10]  Min Xu,et al.  Modeling temporal information using discrete fourier transform for recognizing emotions in user-generated videos , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[11]  Nadia Bianchi-Berthouze,et al.  Naturalistic Affective Expression Classification by a Multi-stage Approach Based on Hidden Markov Models , 2011, ACII.

[12]  Qin Jin,et al.  Multi-modal Dimensional Emotion Recognition using Recurrent Neural Networks , 2015, AVEC@ACM Multimedia.

[13]  Ling Shao,et al.  Spatio-Temporal Laplacian Pyramid Coding for Action Recognition , 2014, IEEE Transactions on Cybernetics.

[14]  Fabien Ringeval,et al.  Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[15]  Ya Li,et al.  Long Short Term Memory Recurrent Neural Network based Multimodal Dimensional Emotion Recognition , 2015, AVEC@ACM Multimedia.

[16]  Björn W. Schuller,et al.  Abandoning emotion classes - towards continuous emotion recognition with modelling of long-range dependencies , 2008, INTERSPEECH.

[17]  Vinod Chandran,et al.  Representation of facial expression categories in continuous arousal-valence space: Feature and correlation , 2014, Image Vis. Comput..

[18]  Thomas S. Huang,et al.  How deep neural networks can improve emotion recognition on video data , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[19]  Hongying Meng,et al.  Affective State Level Recognition in Naturalistic Facial and Vocal Expressions , 2014, IEEE Transactions on Cybernetics.

[20]  Albino Nogueiras,et al.  Speech emotion recognition using hidden Markov models , 2001, INTERSPEECH.

[21]  Björn W. Schuller,et al.  Emotion representation, analysis and synthesis in continuous space: A survey , 2011, Face and Gesture 2011.

[22]  Hatice Gunes,et al.  Dimensional Emotion Prediction from Spontaneous Head Gestures for Interaction with Sensitive Artificial Listeners , 2010, IVA.

[23]  John Cosmas,et al.  Time-Delay Neural Network for Continuous Emotional Dimension Prediction From Facial Expression Sequences , 2016, IEEE Transactions on Cybernetics.

[24]  Lalit Kulkarni,et al.  Study of video based facial expression and emotions recognition methods , 2017, 2017 International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC).

[25]  Richard J. Harris,et al.  Morphing between expressions dissociates continuous from categorical representations of facial expression in the human brain , 2012, Proceedings of the National Academy of Sciences.

[26]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[27]  Dongmei Jiang,et al.  Multimodal Affective Dimension Prediction Using Deep Bidirectional Long Short-Term Memory Recurrent Neural Networks , 2015, AVEC@ACM Multimedia.

[28]  Hatice Gunes,et al.  Continuous Prediction of Spontaneous Affect from Multiple Cues and Modalities in Valence-Arousal Space , 2011, IEEE Transactions on Affective Computing.

[29]  Nicu Sebe,et al.  Facial expression recognition from video sequences: temporal and static modeling , 2003, Comput. Vis. Image Underst..