An Investigation of Emotion Dynamics and Kalman Filtering for Speech-Based Emotion Prediction

Despite recent interest in continuous prediction of dimensional emotions, the dynamical aspect of emotions has received less attention in automated systems. This paper investigates how emotion change can be effectively incorporated to improve continuous prediction of arousal and valence from speech. Significant correlations were found between emotion ratings and their dynamics during investigations on the RECOLA database, and here we examine how to best exploit them using a Kalman filter. In particular, we investigate the correlation between predicted arousal and valence dynamics with arousal and valence ground truth; the Kalman filter internal delay for estimating the state transition matrix; the use of emotion dynamics as a measurement input to a Kalman filter; and how multiple probabilistic Kalman filter outputs can be effectively fused. Evaluation results show that correct dynamics estimation and internal delay settings allow up to 5% and 58% relative improvement in arousal and valence prediction respectively over existing Kalman filter implementations. Fusion based on probabilistic Kalman filter outputs yields further gains.

[1]  Emily Mower Provost,et al.  Emotion Recognition During Speech Using Dynamics of Multiple Regions of the Face , 2015, ACM Trans. Multim. Comput. Commun. Appl..

[2]  Athanasios Katsamanis,et al.  Tracking continuous emotional trends of participants during affective dyadic interactions using body language and speech information , 2013, Image Vis. Comput..

[3]  F. Tuerlinckx,et al.  Journal of Personality and Social Psychology Feelings Change : Accounting for Individual Differences in the Temporal Dynamics of Affect , 2010 .

[4]  Hans W. Guesgen,et al.  Computational Analysis of Emotion Dynamics , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[5]  Björn W. Schuller,et al.  Context-Sensitive Learning for Enhanced Audiovisual Emotion Classification , 2012, IEEE Transactions on Affective Computing.

[6]  Fabien Ringeval,et al.  AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge , 2016, AVEC@ACM Multimedia.

[7]  K. Scherer The dynamic architecture of emotion: Evidence for the component process model , 2009 .

[8]  Arvid Kappas,et al.  The dynamics of emotions in online interaction , 2016, Royal Society Open Science.

[9]  T. Başar,et al.  A New Approach to Linear Filtering and Prediction Problems , 2001 .

[10]  Shrikanth S. Narayanan,et al.  A hierarchical static-dynamic framework for emotion classification , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Rahul Gupta,et al.  Online Affect Tracking with Multimodal Kalman Filters , 2016, AVEC@ACM Multimedia.

[12]  Dongmei Jiang,et al.  Leveraging the Bayesian Filtering Paradigm for Vision-Based Facial Affective State Estimation , 2018, IEEE Transactions on Affective Computing.

[13]  Björn W. Schuller,et al.  The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing , 2016, IEEE Transactions on Affective Computing.

[14]  Emily Mower Provost,et al.  EmoShapelets: Capturing local dynamics of audio-visual affective speech , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[15]  William M. Campbell,et al.  Multi-Modal Audio, Video and Physiological Sensor Learning for Continuous Emotion Prediction , 2016, AVEC@ACM Multimedia.

[16]  Ting Dang,et al.  An Investigation of Annotation Delay Compensation and Output-Associative Fusion for Multimodal Continuous Emotion Prediction , 2015, AVEC@ACM Multimedia.

[17]  Peter Kuppens,et al.  It’s About Time: A Special Section on Affect Dynamics , 2015 .

[18]  Fabien Ringeval,et al.  Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[19]  Emily Mower Provost,et al.  Emotion classification via utterance-level dynamics: A pattern-based approach to characterizing affective expressions , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  L. Lin,et al.  A concordance correlation coefficient to evaluate reproducibility. , 1989, Biometrics.

[21]  Athanasios Katsamanis,et al.  Tracking changes in continuous emotion states using body language and prosodic cues , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Björn W. Schuller,et al.  Categorical and dimensional affect analysis in continuous input: Current trends and future directions , 2013, Image Vis. Comput..