Continuous Emotion Recognition in Speech - Do We Need Recurrence?

Emotion recognition in speech is a meaningful task in affective computing and human-computer interaction. As human emotion is a frequently changing state, it is usually represented as a densely sampled time series of emotional dimensions, typically arousal and valence. For this, recurrent neural network (RNN) architectures are employed by default when it comes to modelling the contours with deep learning approaches. However, the amount of temporal context required is questionable, and it has not yet been clarified whether the consideration of long-term dependencies is actually beneficial. In this contribution, we demonstrate that RNNs are not necessary to accomplish the task of time-continuous emotion recognition. Indeed, results gained indicate that deep neural networks incorporating less complex convolutional layers can provide more accurate models. We highlight the pros and cons of recurrent and nonrecurrent approaches and evaluate our methods on the public SEWA database, which was used as a benchmark in the 2017 and 2018 editions of the Audio-Visual Emotion Challenge.

[1]  Y. X. Zou,et al.  An experimental study of speech emotion recognition based on deep convolutional neural networks , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[2]  Vinod Chandran,et al.  Facial Expression Analysis under Partial Occlusion , 2018, ACM Comput. Surv..

[3]  Fabien Ringeval,et al.  Bags in Bag: Generating Context-Aware Bags for Tracking Emotions from Speech , 2018, INTERSPEECH.

[4]  Björn W. Schuller,et al.  Recent developments in openSMILE, the munich open-source multimedia feature extractor , 2013, ACM Multimedia.

[5]  Ting Dang,et al.  Speech-based Continuous Emotion Prediction by Learning Perception Responses related to Salient Events: A Study based on Vocal Affect Bursts and Cross-Cultural Affect in AVEC 2018 , 2018, AVEC@MM.

[6]  Carlos Busso,et al.  Correcting Time-Continuous Emotional Labels by Modeling the Reaction Lag of Evaluators , 2015, IEEE Transactions on Affective Computing.

[7]  Björn Schuller,et al.  Deep Recurrent Neural Networks for Emotion Recognition in Speech , 2018 .

[8]  Björn Schuller,et al.  Tracking Authentic and In-the-wild Emotions Using Speech , 2018, 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia).

[9]  Jian Huang,et al.  Multimodal Continuous Emotion Recognition with Data Augmentation Using Recurrent Neural Networks , 2018, AVEC@MM.

[10]  J. Russell A circumplex model of affect. , 1980 .

[11]  Fabien Ringeval,et al.  At the Border of Acoustics and Linguistics: Bag-of-Audio-Words for the Recognition of Emotions in Speech , 2016, INTERSPEECH.

[12]  Fabien Ringeval,et al.  SEWA DB: A Rich Database for Audio-Visual Emotion and Sentiment Research in the Wild , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[15]  Fabien Ringeval,et al.  Discriminatively Trained Recurrent Neural Networks for Continuous Dimensional Emotion Recognition from Audio , 2016, IJCAI.

[16]  Björn W. Schuller,et al.  Abandoning emotion classes - towards continuous emotion recognition with modelling of long-range dependencies , 2008, INTERSPEECH.

[17]  Theodoros Iliou,et al.  Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011 , 2012, Artificial Intelligence Review.

[18]  Shizhe Chen,et al.  Multimodal Multi-task Learning for Dimensional and Continuous Emotion Recognition , 2017, AVEC@ACM Multimedia.

[19]  J. Russell,et al.  The relation between valence and arousal in subjective experience. , 2013, Psychological bulletin.

[20]  Tong Zhang,et al.  Multi-cue fusion for emotion recognition in the wild , 2018, Neurocomputing.

[21]  Lovekesh Vig,et al.  Long Short Term Memory Networks for Anomaly Detection in Time Series , 2015, ESANN.

[22]  Fabien Ringeval,et al.  AVEC 2017: Real-life Depression, and Affect Recognition Workshop and Challenge , 2017, AVEC@ACM Multimedia.

[23]  Björn W. Schuller,et al.  From Hard to Soft: Towards more Human-like Emotion Recognition by Modelling the Perception Uncertainty , 2017, ACM Multimedia.

[24]  Björn W. Schuller,et al.  The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing , 2016, IEEE Transactions on Affective Computing.

[25]  Ioannis Pitas,et al.  Facial expression analysis under partial occlusion , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[26]  Jean-Philippe Thiran,et al.  Prediction of asynchronous dimensional emotion ratings from audiovisual and physiological data , 2015, Pattern Recognit. Lett..

[27]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Fabien Ringeval,et al.  AVEC 2018 Workshop and Challenge: Bipolar Disorder and Cross-Cultural Affect Recognition , 2018, AVEC@MM.

[29]  Qin Jin,et al.  Multi-modal Multi-cultural Dimensional Continues Emotion Recognition in Dyadic Interactions , 2018, AVEC@MM.

[30]  Ting Dang,et al.  An Investigation of Annotation Delay Compensation and Output-Associative Fusion for Multimodal Continuous Emotion Prediction , 2015, AVEC@ACM Multimedia.