Jointly Predicting Arousal, Valence and Dominance with Multi-Task Learning

An appealing representation of emotions is the use of emotional attributes such as arousal (passive versus active), valence (negative versus positive) and dominance (weak versus strong). While previous studies have considered these dimensions as orthogonal descriptors to represent emotions, there are strong theoretical and practical evidences showing the interrelation between these emotional attributes. This observation suggests that predicting emotional attributes with a unified framework should outperform machine learning algorithms that separately predict each attribute. This study presents methods to jointly learn emotional attributes by exploiting their interdependencies. The framework relies on multi-task learning (MTL) implemented with deep neural networks (DNN) with shared hidden layers. The framework provides a principled approach to learn shared feature representations that maximize the performance of regression models. The results of within-corpus and cross-corpora evaluation show the benefits of MTL over single task learning (STL). MTL achieves gains on concordance correlation coefficient (CCC) as high as 4.7% for within-corpus evaluations, and 14.0% for cross-corpora evaluations. The visualization of the activations of the last hidden layers illustrates that MTL creates better feature representation. The best structure has shared layers followed by attribute-dependent layers, capturing better the relation between attributes.

[1]  Yang Liu,et al.  A Multi-Task Learning Framework for Emotion Recognition Using 2D Continuous Space , 2017, IEEE Transactions on Affective Computing.

[2]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[3]  Fabio Valente,et al.  The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism , 2013, INTERSPEECH.

[4]  J. Russell Evidence of Convergent Validity on the Dimensions of Affect , 1978 .

[5]  K. Scherer,et al.  The World of Emotions is not Two-Dimensional , 2007, Psychological science.

[6]  Björn W. Schuller,et al.  Towards More Reality in the Recognition of Emotional Speech , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[7]  Björn W. Schuller,et al.  Abandoning emotion classes - towards continuous emotion recognition with modelling of long-range dependencies , 2008, INTERSPEECH.

[8]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[9]  Carlos Busso,et al.  Interpreting ambiguous emotional expressions , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[10]  H. Critchley,et al.  Neural correlates of processing valence and arousal in affective words. , 2006, Cerebral cortex.

[11]  J. Russell A circumplex model of affect. , 1980 .

[12]  J. Russell,et al.  Evidence for a three-factor theory of emotions , 1977 .

[13]  Lori Lamel,et al.  Challenges in real-life emotion annotation and machine learning based detection , 2005, Neural Networks.

[14]  Carlos Busso,et al.  MSP-IMPROV: An Acted Corpus of Dyadic Interactions to Study Emotion Perception , 2017, IEEE Transactions on Affective Computing.

[15]  A. M. Oliveira,et al.  JOINT MODEL-PARAMETER VALIDATION OF SELF-ESTIMATES OF VALENCE AND AROUSAL: PROBING A DIFFERENTIAL-WEIGHTING MODEL OF AFFECTIVE INTENSITY. , 2006 .

[16]  Carlos Busso,et al.  Unveiling the Acoustic Properties that Describe the Valence Dimension , 2012, INTERSPEECH.

[17]  Stefan Scherer,et al.  Learning representations of emotional speech with deep convolutional generative adversarial networks , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Björn W. Schuller,et al.  Multi-task deep neural network with shared hidden layers: Breaking down the wall between emotion representations , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[20]  Carlos Busso,et al.  Building Naturalistic Emotionally Balanced Speech Corpus by Retrieving Emotional Speech from Existing Podcast Recordings , 2019, IEEE Transactions on Affective Computing.

[21]  Hatice Gunes,et al.  Continuous Prediction of Spontaneous Affect from Multiple Cues and Modalities in Valence-Arousal Space , 2011, IEEE Transactions on Affective Computing.

[22]  Jean-Philippe Thiran,et al.  Prediction of asynchronous dimensional emotion ratings from audiovisual and physiological data , 2015, Pattern Recognit. Lett..

[23]  Björn W. Schuller,et al.  Context-Sensitive Learning for Enhanced Audiovisual Emotion Classification , 2012, IEEE Transactions on Affective Computing.

[24]  Fabien Ringeval,et al.  AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge , 2016, AVEC@ACM Multimedia.

[25]  Carlos Busso,et al.  Increasing the Reliability of Crowdsourcing Evaluations Using Online Quality Assessment , 2016, IEEE Transactions on Affective Computing.

[26]  J. Russell,et al.  An approach to environmental psychology , 1974 .

[27]  Zhigang Deng,et al.  Emotion recognition based on phoneme classes , 2004, INTERSPEECH.

[28]  Hatice Gunes,et al.  Automatic, Dimensional and Continuous Emotion Recognition , 2010, Int. J. Synth. Emot..

[29]  Zhigang Deng,et al.  Analysis of emotion recognition using facial expressions, speech and multimodal information , 2004, ICMI '04.

[30]  Carlos Busso,et al.  Defining Emotionally Salient Regions Using Qualitative Agreement Method , 2016, INTERSPEECH.

[31]  John H. L. Hansen,et al.  A study of speaker verification performance with expressive speech , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).