Continuous affect recognition with weakly supervised learning

Recognizing a person’s affective state from audio-visual signals is an essential capability for intelligent interaction. Insufficient training data and the unreliable labels of affective dimensions (e.g., valence and arousal) are two major challenges in continuous affect recognition. In this paper, we propose a weakly supervised learning approach based on hybrid deep neural network and bidirectional long short-term memory recurrent neural network (DNN-BLSTM). It firstly maps the audio/visual features into a more discriminative space via the powerful modelling capacities of DNN, then models the temporal dynamics of affect via BLSTM. To reduce the negative impact of the unreliable labels, we utilize a temporal label (TL) along with a robust loss function (RL) for incorporating weak supervision into the learning process of the DNN-BLSTM model. Therefore, the proposed method not only has a simpler structure than the deep BLSTM model in He et al. (24) which requires more training data, but also is robust to noisy and unreliable labels. Single modal and multimodal affect recognition experiments have been carried out on the RECOLA dataset. Single modal recognition results show that the proposed method with TL and RL obtains remarkable improvements on both arousal and valence in terms of concordance correlation coefficient (CCC), while multimodal recognition results show that with less feature streams, our proposed approach obtains better or comparable results with the state-of-the-art methods.

[1]  Ya Li,et al.  Long Short Term Memory Recurrent Neural Network based Multimodal Dimensional Emotion Recognition , 2015, AVEC@ACM Multimedia.

[2]  P J Webros BACKPROPAGATION THROUGH TIME: WHAT IT DOES AND HOW TO DO IT , 1990 .

[3]  Qin Jin,et al.  Multi-modal Dimensional Emotion Recognition using Recurrent Neural Networks , 2015, AVEC@ACM Multimedia.

[4]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Björn W. Schuller,et al.  Introducing CURRENNT: the munich open-source CUDA recurrent neural network toolkit , 2015, J. Mach. Learn. Res..

[6]  Mohamed Chetouani,et al.  Robust continuous prediction of human emotions using multiscale dynamic cues , 2012, ICMI '12.

[7]  William M. Campbell,et al.  Multi-Modal Audio, Video and Physiological Sensor Learning for Continuous Emotion Prediction , 2016, AVEC@ACM Multimedia.

[8]  Björn W. Schuller,et al.  AVEC 2014: 3D Dimensional Affect and Depression Recognition Challenge , 2014, AVEC '14.

[9]  Удк,et al.  ‘ Unmasking the Face : A Guide to Recognizing Emotions from Facial Clues , 2018 .

[10]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[11]  Fabien Ringeval,et al.  Facing Realism in Spontaneous Emotion Recognition from Speech: Feature Enhancement by Autoencoder with LSTM Neural Networks , 2016, INTERSPEECH.

[12]  Laurens van der Maaten Audio-visual emotion challenge 2012: a simple approach , 2012, ICMI '12.

[13]  Maja Pantic,et al.  The first facial expression recognition and analysis challenge , 2011, Face and Gesture 2011.

[14]  Josef Kittler,et al.  Floating search methods in feature selection , 1994, Pattern Recognit. Lett..

[15]  Wolfgang Minker,et al.  Emotion Recognition and Depression Diagnosis by Acoustic and Visual Features: A Multimodal Approach , 2014, AVEC '14.

[16]  Ronald J. Williams,et al.  Gradient-based learning algorithms for recurrent networks and their computational complexity , 1995 .

[17]  Emily Mower Provost,et al.  Discretized Continuous Speech Emotion Recognition with Multi-Task Deep Recurrent Neural Network , 2017, INTERSPEECH.

[18]  Navdeep Jaitly,et al.  Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[19]  Christine L. Lisetti Affective computing , 1998, Pattern Analysis and Applications.

[20]  Sang Hyun Park,et al.  Facial expression recognition based on local region specific features and support vector machines , 2016, Multimedia Tools and Applications.

[21]  Eun-Soo Kim,et al.  Human facial expression recognition using curvelet feature extraction and normalized mutual information feature selection , 2014, Multimedia Tools and Applications.

[22]  Fabien Ringeval,et al.  Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[23]  Fabien Ringeval,et al.  AVEC 2015: The 5th International Audio/Visual Emotion Challenge and Workshop , 2015, ACM Multimedia.

[24]  P. M. Prenter Splines and variational methods , 1975 .

[25]  Dongmei Jiang,et al.  Multimodal Affective Dimension Prediction Using Deep Bidirectional Long Short-Term Memory Recurrent Neural Networks , 2015, AVEC@ACM Multimedia.

[26]  Hatice Gunes,et al.  Continuous Prediction of Spontaneous Affect from Multiple Cues and Modalities in Valence-Arousal Space , 2011, IEEE Transactions on Affective Computing.

[27]  Hatice Gunes,et al.  Automatic Segmentation of Spontaneous Data using Dimensional Labels from Multiple Coders , 2010 .

[28]  Jürgen Schmidhuber,et al.  Learning to forget: continual prediction with LSTM , 1999 .

[29]  Ya Li,et al.  Multi-scale Temporal Modeling for Dimensional Emotion Recognition in Video , 2014, AVEC '14.

[30]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[31]  Ze-Nian Li,et al.  Recognition of facial expressions based on salient geometric features and support vector machines , 2016, Multimedia Tools and Applications.

[32]  David G. Stork,et al.  Pattern Classification , 1973 .

[33]  Zheng Zhang,et al.  FERA 2017 - Addressing Head Pose in the Third Facial Expression Recognition and Analysis Challenge , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[34]  Fabien Ringeval,et al.  Summary for AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge , 2016, ACM Multimedia.

[35]  Louis-Philippe Morency,et al.  Step-wise emotion recognition using concatenated-HMM , 2012, ICMI '12.

[36]  Tamás D. Gedeon,et al.  Video and Image based Emotion Recognition Challenges in the Wild: EmotiW 2015 , 2015, ICMI.

[37]  Bo Sun,et al.  Exploring Multimodal Visual Features for Continuous Affect Recognition , 2016, AVEC@ACM Multimedia.

[38]  Fabien Ringeval,et al.  AV+EC 2015: The First Affect Recognition Challenge Bridging Across Audio, Video, and Physiological Data , 2015, AVEC@ACM Multimedia.

[39]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Fabien Ringeval,et al.  Reconstruction-error-based learning for continuous emotion recognition in speech , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Fabien Ringeval,et al.  AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge , 2016, AVEC@ACM Multimedia.

[42]  Dongmei Jiang,et al.  Relevance units machine based dimensional and continuous speech emotion prediction , 2014, Multimedia Tools and Applications.

[43]  Björn W. Schuller,et al.  AVEC 2012: the continuous audio/visual emotion challenge , 2012, ICMI '12.

[44]  J. Russell A circumplex model of affect. , 1980 .

[45]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[46]  Jesse Hoey,et al.  EmotiW 2016: video and group-level emotion recognition challenges , 2016, ICMI.

[47]  Shizhe Chen,et al.  Multimodal Multi-task Learning for Dimensional and Continuous Emotion Recognition , 2017, AVEC@ACM Multimedia.

[48]  Björn W. Schuller,et al.  Combining Long Short-Term Memory and Dynamic Bayesian Networks for Incremental Emotion-Sensitive Artificial Listening , 2010, IEEE Journal of Selected Topics in Signal Processing.

[49]  Iñaki Inza,et al.  Weak supervision and other non-standard classification problems: A taxonomy , 2016, Pattern Recognit. Lett..

[50]  Pavel Matejka,et al.  Multimodal Emotion Recognition for AVEC 2016 Challenge , 2016, AVEC@ACM Multimedia.

[51]  Rahul Gupta,et al.  Online Affect Tracking with Multimodal Kalman Filters , 2016, AVEC@ACM Multimedia.

[52]  Fabio Valente,et al.  The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism , 2013, INTERSPEECH.

[53]  Björn W. Schuller,et al.  Feature enhancement by deep LSTM networks for ASR in reverberant multisource environments , 2014, Comput. Speech Lang..

[54]  Carsten Rother,et al.  Weakly supervised discriminative localization and classification: a joint learning process , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[55]  Nasrollah Moghaddam Charkari,et al.  Multimodal information fusion application to human emotion recognition from face and speech , 2010, Multimedia Tools and Applications.

[56]  Jesse Hoey,et al.  From individual to group-level emotion recognition: EmotiW 5.0 , 2017, ICMI.

[57]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[58]  Fabien Ringeval,et al.  Discriminatively Trained Recurrent Neural Networks for Continuous Dimensional Emotion Recognition from Audio , 2016, IJCAI.

[59]  Björn W. Schuller,et al.  Abandoning emotion classes - towards continuous emotion recognition with modelling of long-range dependencies , 2008, INTERSPEECH.

[60]  Lijun Yin,et al.  FERA 2015 - second Facial Expression Recognition and Analysis challenge , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[61]  Albert Ali Salah,et al.  Ensemble CCA for Continuous Emotion Prediction , 2014, AVEC '14.

[62]  Peter Robinson,et al.  Dimensional affect recognition using Continuous Conditional Random Fields , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[63]  SchmidhuberJürgen,et al.  2005 Special Issue , 2005 .

[64]  Fabien Ringeval,et al.  AVEC 2017: Real-life Depression, and Affect Recognition Workshop and Challenge , 2017, AVEC@ACM Multimedia.

[65]  Constantine Kotropoulos,et al.  Fast sequential floating forward selection applied to emotional speech features estimated on DES and SUSAS data collections , 2006, 2006 14th European Signal Processing Conference.

[66]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[67]  Uma Shanker Tiwary,et al.  Affect representation and recognition in 3D continuous valence–arousal–dominance space , 2016, Multimedia Tools and Applications.

[68]  Zafer Aydin,et al.  BAUM-2: a multilingual audio-visual affective face database , 2014, Multimedia Tools and Applications.

[69]  Björn W. Schuller,et al.  AVEC 2013: the continuous audio/visual emotion and depression recognition challenge , 2013, AVEC@ACM Multimedia.

[70]  Thomas Fillon,et al.  YAAFE, an Easy to Use and Efficient Audio Feature Extraction Software , 2010, ISMIR.

[71]  Björn W. Schuller,et al.  LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework , 2013, Image Vis. Comput..

[72]  Dongmei Jiang,et al.  Multimodal dimensional affect recognition using deep bidirectional long short-term memory recurrent neural networks , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).