Dynamic speech emotion recognition with state-space models

Automatic emotion recognition from speech has been focused mainly on identifying categorical or static affect states, but the spectrum of human emotion is continuous and time-varying. In this paper, we present a recognition system for dynamic speech emotion based on state-space models (SSMs). The prediction of the unknown emotion trajectory in the affect space spanned by Arousal, Valence, and Dominance (A-V-D) descriptors is cast as a time series filtering task. The state space models we investigated include a standard linear model (Kalman filter) as well as novel non-linear, non-parametric Gaussian Processes (GP) based SSM. We use the AVEC 2014 database for evaluation, which provides ground truth A-V-D labels which allows state and measurement functions to be learned separately simplifying the model training. For the filtering with GP SSM, we used two approximation methods: a recently proposed analytic method and Particle filter. All models were evaluated in terms of average Pearson correlation R and root mean square error (RMSE). The results show that using the same feature vectors, the GP SSMs achieve twice higher correlation and twice smaller RMSE than a Kalman filter.

[1]  Carl E. Rasmussen,et al.  State-Space Inference and Learning with Gaussian Processes , 2010, AISTATS.

[2]  Nando de Freitas,et al.  Sequential Monte Carlo Methods in Practice , 2001, Statistics for Engineering and Information Science.

[3]  Carl E. Rasmussen,et al.  Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC , 2013, NIPS.

[4]  Ieee Staff 2017 25th European Signal Processing Conference (EUSIPCO) , 2017 .

[5]  Carl E. Rasmussen,et al.  Robust Filtering and Smoothing with Gaussian Processes , 2012, IEEE Transactions on Automatic Control.

[6]  Dieter Fox,et al.  GP-BayesFilters: Bayesian filtering using Gaussian process prediction and observation models , 2008, IROS.

[7]  Heng Wang,et al.  Depression recognition based on dynamic facial and vocal expression features using partial least square regression , 2013, AVEC@ACM Multimedia.

[8]  Björn W. Schuller,et al.  Abandoning emotion classes - towards continuous emotion recognition with modelling of long-range dependencies , 2008, INTERSPEECH.

[9]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[10]  S. Haykin Kalman Filtering and Neural Networks , 2001 .

[11]  Youngmoo E. Kim,et al.  Prediction of Time-Varying Musical Mood Distributions Using Kalman Filtering , 2010, 2010 Ninth International Conference on Machine Learning and Applications.

[12]  Christopher K. I. Williams,et al.  Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .

[13]  Jongmin Kim,et al.  Phoneme Classification using Constrained Variational Gaussian Process Dynamical System , 2012, NIPS.

[14]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[15]  David J. Fleet,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE Gaussian Process Dynamical Model , 2007 .

[16]  Gustav Eje Henter,et al.  Gaussian process dynamical models for nonparametric speech representation and synthesis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Tomoko Matsui,et al.  Music Genre and Emotion Recognition Using Gaussian Processes , 2014, IEEE Access.

[18]  M. Deisenroth,et al.  A general perspective on Gaussian filtering and smoothing: Explaining current and deriving new algorithms , 2011, Proceedings of the 2011 American Control Conference.

[19]  Uwe D. Hanebeck,et al.  Analytic moment-based Gaussian process filtering , 2009, ICML '09.

[20]  Carl E. Rasmussen,et al.  Gaussian Processes for Machine Learning (GPML) Toolbox , 2010, J. Mach. Learn. Res..

[21]  Mohamed Chetouani,et al.  Robust continuous prediction of human emotions using multiscale dynamic cues , 2012, ICMI '12.

[22]  Björn W. Schuller,et al.  AVEC 2014: 3D Dimensional Affect and Depression Recognition Challenge , 2014, AVEC '14.

[23]  Nadia Bianchi-Berthouze,et al.  Naturalistic Affective Expression Classification by a Multi-stage Approach Based on Hidden Markov Models , 2011, ACII.