Multimodal Prediction of Affective Dimensions via Fusing Multiple Regression Techniques

This paper presents a multimodal approach to predict affective dimensions, that makes full use of features from audio, video, Electrodermal Activity (EDA) and Electrocardiogram (ECG) using three regression techniques such as support vector regression (SVR), partial least squares regression (PLS), and a deep bidirectional long short-term memory recurrent neural network (DBLSTM-RNN) regression. Each of the three regression techniques performs multimodal affective dimension prediction followed by a fusion of different models on features of four modalities using a support vector regression. A support vector regression is also applied for a final fusion of the three regression systems. Experiments show that our proposed approach obtains promising results on the AVEC 2015 benchmark dataset for prediction of multimodal affective dimensions. For the development set, the concordance correlation coefficient (CCC) achieves results of 0.856 for arousal and 0.720 for valence, which increases 3.88 % and 4.66 % of the top-performer of AVEC 2015 in arousal and valence, respectively.

[1]  Björn W. Schuller,et al.  Introducing CURRENNT: the munich open-source CUDA recurrent neural network toolkit , 2015, J. Mach. Learn. Res..

[2]  Thomas S. Huang,et al.  How deep neural networks can improve emotion recognition on video data , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[3]  Michel F. Valstar,et al.  Local Gabor Binary Patterns from Three Orthogonal Planes for Automatic Facial Expression Recognition , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[4]  J. Russell,et al.  The circumplex model of affect: An integrative approach to affective neuroscience, cognitive development, and psychopathology , 2005, Development and Psychopathology.

[5]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[6]  Shuzhi Sam Ge,et al.  Speaker State Classification Based on Fusion of Asymmetric SIMPLS and Support Vector Machines , 2011, INTERSPEECH.

[7]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[8]  Fabien Ringeval,et al.  AV+EC 2015: The First Affect Recognition Challenge Bridging Across Audio, Video, and Physiological Data , 2015, AVEC@ACM Multimedia.

[9]  Fabien Ringeval,et al.  AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge , 2016, AVEC@ACM Multimedia.

[10]  Ya Li,et al.  Long Short Term Memory Recurrent Neural Network based Multimodal Dimensional Emotion Recognition , 2015, AVEC@ACM Multimedia.

[11]  Heng Wang,et al.  Depression recognition based on dynamic facial and vocal expression features using partial least square regression , 2013, AVEC@ACM Multimedia.

[12]  Dongmei Jiang,et al.  Multimodal Affective Dimension Prediction Using Deep Bidirectional Long Short-Term Memory Recurrent Neural Networks , 2015, AVEC@ACM Multimedia.

[13]  Björn W. Schuller,et al.  The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing , 2016, IEEE Transactions on Affective Computing.

[14]  Jean-Philippe Thiran,et al.  Prediction of asynchronous dimensional emotion ratings from audiovisual and physiological data , 2015, Pattern Recognit. Lett..

[15]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Fabien Ringeval,et al.  Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[17]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..