Prediction of asynchronous dimensional emotion ratings from audiovisual and physiological data

We study the relevance of context-learning for handling asynchrony of annotation.We unite audiovisual and physiological data for continuous affect analysis.We propose multi-time resolution features extraction from multimodal data.The use of context-learning allows to include reaction time delay of raters.Fusion of audiovisual and physiological data performs best on arousal and valence. Automatic emotion recognition systems based on supervised machine learning require reliable annotation of affective behaviours to build useful models. Whereas the dimensional approach is getting more and more popular for rating affective behaviours in continuous time domains, e.g., arousal and valence, methodologies to take into account reaction lags of the human raters are still rare. We therefore investigate the relevance of using machine learning algorithms able to integrate contextual information in the modelling, like long short-term memory recurrent neural networks do, to automatically predict emotion from several (asynchronous) raters in continuous time domains, i.e., arousal and valence. Evaluations are performed on the recently proposed RECOLA multimodal database (27 subjects, 5? min of data and six raters for each), which includes audio, video, and physiological (ECG, EDA) data. In fact, studies uniting audiovisual and physiological information are still very rare. Features are extracted with various window sizes for each modality and performance for the automatic emotion prediction is compared for both different architectures of neural networks and fusion approaches (feature-level/decision-level). The results show that: (i) LSTM network can deal with (asynchronous) dependencies found between continuous ratings of emotion with video data, (ii) the prediction of the emotional valence requires longer analysis window than for arousal and (iii) a decision-level fusion leads to better performance than a feature-level fusion. The best performance (concordance correlation coefficient) for the multimodal emotion prediction is 0.804 for arousal and 0.528 for valence.

[1]  Carlos Busso,et al.  Correcting Time-Continuous Emotional Labels by Modeling the Reaction Lag of Evaluators , 2015, IEEE Transactions on Affective Computing.

[2]  Kostas Karpouzis,et al.  The HUMAINE Database: Addressing the Collection and Annotation of Naturalistic and Induced Emotional Data , 2007, ACII.

[3]  Jeffrey M. Hausdorff,et al.  Dynamic markers of altered gait rhythm in amyotrophic lateral sclerosis. , 2000, Journal of applied physiology.

[4]  L. Lin,et al.  A concordance correlation coefficient to evaluate reproducibility. , 1989, Biometrics.

[5]  Gunnar Farnebäck,et al.  Two-Frame Motion Estimation Based on Polynomial Expansion , 2003, SCIA.

[6]  Hatice Gunes,et al.  Automatic Segmentation of Spontaneous Data using Dimensional Labels from Multiple Coders , 2010 .

[7]  Björn W. Schuller,et al.  Building Autonomous Sensitive Artificial Listeners , 2012, IEEE Transactions on Affective Computing.

[8]  Shrikanth S. Narayanan,et al.  Support Vector Regression for Automatic Recognition of Spontaneous Emotions in Speech , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[9]  Jennifer Healey,et al.  Toward Machine Emotional Intelligence: Analysis of Affective Physiological State , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  P J Webros BACKPROPAGATION THROUGH TIME: WHAT IT DOES AND HOW TO DO IT , 1990 .

[11]  K. Scherer,et al.  On the Acoustics of Emotion in Audio: What Speech, Music, and Sound have in Common , 2013, Front. Psychol..

[12]  Athanasios Katsamanis,et al.  Tracking changes in continuous emotion states using body language and prosodic cues , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Shrikanth S. Narayanan,et al.  The Vera am Mittag German audio-visual emotional speech database , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[14]  Fernando De la Torre,et al.  Supervised Descent Method and Its Applications to Face Alignment , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  A. Tversky Intransitivity of preferences. , 1969 .

[16]  R. T. Pivik,et al.  Handbook of Psychophysiology: Sleep and Dreaming , 2007 .

[17]  Thomas Martinetz,et al.  The Intrinsic Recurrent Support Vector Machine , 2007, ESANN.

[18]  Yi-Ping Hung,et al.  2D Face Alignment and Pose Estimation Based on 3D Facial Models , 2012, 2012 IEEE International Conference on Multimedia and Expo.

[19]  R. Levenson Emotion and the autonomic nervous system: A prospectus for research on autonomic specificity. , 1988 .

[20]  Hatice Gunes,et al.  Automatic, Dimensional and Continuous Emotion Recognition , 2010, Int. J. Synth. Emot..

[21]  Björn W. Schuller,et al.  Frame vs. Turn-Level: Emotion Recognition from Speech Considering Static and Dynamic Processing , 2007, ACII.

[22]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[23]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[24]  Hatice Gunes,et al.  Continuous Prediction of Spontaneous Affect from Multiple Cues and Modalities in Valence-Arousal Space , 2011, IEEE Transactions on Affective Computing.

[25]  Wen-Jing Yan,et al.  How Fast are the Leaked Facial Expressions: The Duration of Micro-Expressions , 2013 .

[26]  Fabien Ringeval,et al.  Time-Scale Feature Extractions for Emotional Speech Characterization , 2009, Cognitive Computation.

[27]  Willis J. Tompkins,et al.  A Real-Time QRS Detection Algorithm , 1985, IEEE Transactions on Biomedical Engineering.

[28]  Carlos Busso,et al.  Analysis and Compensation of the Reaction Lag of Evaluators in Continuous Emotional Annotations , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[29]  Ronald D Berger,et al.  Heart Rate Variability , 2006, Journal of cardiovascular electrophysiology.

[30]  Kristian Kroschel,et al.  Audio-visual emotion recognition using an emotion space concept , 2008, 2008 16th European Signal Processing Conference.

[31]  Fabien Ringeval,et al.  Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[32]  M. Dawson,et al.  The electrodermal system , 2007 .

[33]  Björn W. Schuller,et al.  Categorical and dimensional affect analysis in continuous input: Current trends and future directions , 2013, Image Vis. Comput..

[34]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[35]  Florian Eyben,et al.  Real-time Speech and Music Classification by Large Audio Feature Space Extraction , 2015 .

[36]  Aleksandar Kalauzi,et al.  Extracting complexity waveforms from one-dimensional signals , 2009, Nonlinear biomedical physics.

[37]  Mohamed Chetouani,et al.  Robust continuous prediction of human emotions using multiscale dynamic cues , 2012, ICMI '12.

[38]  Björn W. Schuller,et al.  A multitask approach to continuous five-dimensional affect sensing in natural speech , 2012, TIIS.

[39]  Björn W. Schuller,et al.  On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues , 2009, Journal on Multimodal User Interfaces.

[40]  Fabio Valente,et al.  The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism , 2013, INTERSPEECH.

[41]  Björn W. Schuller,et al.  Recent developments in openSMILE, the munich open-source multimedia feature extractor , 2013, ACM Multimedia.

[42]  Fabien Ringeval,et al.  The INTERSPEECH 2014 computational paralinguistics challenge: cognitive & physical load , 2014, INTERSPEECH.

[43]  Jay Hall,et al.  The Effects of a Normative Intervention on Group Decision-Making Performance , 1970 .

[44]  Constantine Kotropoulos,et al.  Emotional speech recognition: Resources, features, and methods , 2006, Speech Commun..

[45]  Jonghwa Kim,et al.  Bimodal Emotion Recognition using Speech and Physiological Changes , 2007 .

[46]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[47]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[48]  P. Ekman,et al.  Facial action coding system: a technique for the measurement of facial movement , 1978 .

[49]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[50]  Hatice Gunes,et al.  Audio-Visual Classification and Fusion of Spontaneous Affective Data in Likelihood Space , 2010, 2010 20th International Conference on Pattern Recognition.

[51]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[52]  Jing Xiao,et al.  Robust full‐motion recovery of head by dynamic templates and re‐registration techniques , 2003 .