Online Affect Tracking with Multimodal Kalman Filters

Arousal and valence have been widely used to represent emotions dimensionally and measure them continuously in time. In this paper, we introduce a computational framework for tracking these affective dimensions from multimodal data as an entry to the Multimodal Affect Recognition Sub-Challenge of the 2016 Audio/Visual Emotion Challenge and Workshop (AVEC2016). We propose a linear dynamical system approach with a late fusion method that accounts for the dynamics of the affective state evolution (i.e., arousal or valence). To this end, single-modality predictions are modeled as observations in a Kalman filter formulation in order to continuously track each affective dimension. Leveraging the inter-correlations between arousal and valence, we use the predicted arousal as an additional feature to improve valence predictions. Furthermore, we propose a conditional framework to select Kalman filters of different modalities while tracking. This framework employs voicing probability and facial posture cues to detect the absence or presence of each input modality. Our multimodal fusion results on the development and the test set provide a statistically significant improvement over the baseline system from AVEC2016. The proposed approach can be potentially extended to other multimodal tasks with inter-correlated behavioral dimensions.

[1]  Hatice Gunes,et al.  Audio-Visual Classification and Fusion of Spontaneous Affective Data in Likelihood Space , 2010, 2010 20th International Conference on Pattern Recognition.

[2]  R. Shumway,et al.  AN APPROACH TO TIME SERIES SMOOTHING AND FORECASTING USING THE EM ALGORITHM , 1982 .

[3]  Seungmin Rho,et al.  SMERS: Music Emotion Recognition Using Support Vector Regression , 2009, ISMIR.

[4]  Shrikanth Narayanan,et al.  The USC Creative IT Database: A Multimodal Database of Theatrical Improvisation , 2010 .

[5]  Eric O. Postma,et al.  Vocal Emotion Recognition with Log-Gabor Filters , 2015, AVEC@ACM Multimedia.

[6]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[7]  Michel F. Valstar,et al.  Local Gabor Binary Patterns from Three Orthogonal Planes for Automatic Facial Expression Recognition , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[8]  Björn W. Schuller,et al.  Automatic recognition of emotion evoked by general sound events , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  T. Başar,et al.  A New Approach to Linear Filtering and Prediction Problems , 2001 .

[10]  Alexander J. Smola,et al.  Support Vector Method for Function Approximation, Regression Estimation and Signal Processing , 1996, NIPS.

[11]  Björn W. Schuller,et al.  Timing levels in segment-based speech emotion recognition , 2006, INTERSPEECH.

[12]  S. Shan,et al.  Maximizing intra-individual correlations for face recognition across pose differences , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Dongmei Jiang,et al.  Multimodal dimensional affect recognition using deep bidirectional long short-term memory recurrent neural networks , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[14]  Mark J. F. Gales,et al.  Unsupervised clustering of emotion and voice styles for expressive TTS , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Athanasios Katsamanis,et al.  Tracking continuous emotional trends of participants during affective dyadic interactions using body language and speech information , 2013, Image Vis. Comput..

[16]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[17]  Ya Li,et al.  Long Short Term Memory Recurrent Neural Network based Multimodal Dimensional Emotion Recognition , 2015, AVEC@ACM Multimedia.

[18]  Ya Li,et al.  Multi-scale Temporal Modeling for Dimensional Emotion Recognition in Video , 2014, AVEC '14.

[19]  Markus Kächele,et al.  Inferring Depression and Affect from Application Dependent Meta Knowledge , 2014, AVEC '14.

[20]  M. A. Rowe,et al.  Guide for Analysing Electrodermal Activity & Skin Conductance Responses for Psychological Experiments , 2013 .

[21]  M. Benedek,et al.  A continuous measure of phasic electrodermal activity , 2010, Journal of Neuroscience Methods.

[22]  C. Nickerson A note on a concordance correlation coefficient to evaluate reproducibility , 1997 .

[23]  Björn W. Schuller,et al.  CCA based feature selection with application to continuous depression recognition from acoustic speech features , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Davis E. King,et al.  Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[25]  J. F. Kaiser,et al.  On a simple algorithm to calculate the 'energy' of a signal , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[26]  Fabien Ringeval,et al.  Summary for AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge , 2016, ACM Multimedia.

[27]  Shrikanth S. Narayanan,et al.  Sparse Representation of Electrodermal Activity With Knowledge-Driven Dictionaries , 2015, IEEE Transactions on Biomedical Engineering.

[28]  Björn W. Schuller,et al.  Abandoning emotion classes - towards continuous emotion recognition with modelling of long-range dependencies , 2008, INTERSPEECH.

[29]  Jacob Cohen,et al.  Applied multiple regression/correlation analysis for the behavioral sciences , 1979 .

[30]  L. Lin,et al.  A concordance correlation coefficient to evaluate reproducibility. , 1989, Biometrics.

[31]  Tanaya Guha,et al.  Multimodal Prediction of Affective Dimensions and Depression in Human-Computer Interactions , 2014, AVEC '14.

[32]  Loïc Kessous,et al.  Modeling naturalistic affective states via facial and vocal expressions recognition , 2006, ICMI '06.

[33]  Shrikanth S. Narayanan,et al.  Primitives-based evaluation and estimation of emotions in speech , 2007, Speech Commun..

[34]  Xinxing Li,et al.  SVR based double-scale regression for dynamic emotion prediction in music , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  J. Russell A circumplex model of affect. , 1980 .

[36]  Sanjeev Khudanpur,et al.  A pitch extraction algorithm tuned for automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Geoffrey E. Hinton,et al.  Parameter estimation for linear dynamical systems , 1996 .

[38]  L. F. Barrett Discrete Emotions or Dimensions? The Role of Valence Focus and Arousal Focus , 1998 .

[39]  Fabien Ringeval,et al.  Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[40]  Dana Kulic,et al.  Affective State Estimation for Human–Robot Interaction , 2007, IEEE Transactions on Robotics.

[41]  Daniel McDuff,et al.  AffectAura: an intelligent system for emotional memory , 2012, CHI.

[42]  Thomas Fritz,et al.  Stuck and Frustrated or in Flow and Happy: Sensing Developers' Emotions and Progress , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[43]  Youngmoo E. Kim,et al.  Prediction of Time-Varying Musical Mood Distributions Using Kalman Filtering , 2010, 2010 Ninth International Conference on Machine Learning and Applications.

[44]  Fabien Ringeval,et al.  AV+EC 2015: The First Affect Recognition Challenge Bridging Across Audio, Video, and Physiological Data , 2015, AVEC@ACM Multimedia.

[45]  Björn W. Schuller,et al.  Context-Sensitive Learning for Enhanced Audiovisual Emotion Classification , 2012, IEEE Transactions on Affective Computing.

[46]  Fabien Ringeval,et al.  AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge , 2016, AVEC@ACM Multimedia.

[47]  Fabien Ringeval,et al.  AVEC 2015: The 5th International Audio/Visual Emotion Challenge and Workshop , 2015, ACM Multimedia.

[48]  Dongmei Jiang,et al.  Multimodal Affective Dimension Prediction Using Deep Bidirectional Long Short-Term Memory Recurrent Neural Networks , 2015, AVEC@ACM Multimedia.

[49]  Björn W. Schuller,et al.  The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing , 2016, IEEE Transactions on Affective Computing.

[50]  Hatice Gunes,et al.  Continuous Prediction of Spontaneous Affect from Multiple Cues and Modalities in Valence-Arousal Space , 2011, IEEE Transactions on Affective Computing.

[51]  Vassilis Digalakis,et al.  Speech Emotion Recognition using non-linear Teager energy based features in noisy environments , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).