Multi-modal Fusion for Continuous Emotion Recognition by Using Auto-Encoders

Human stress detection is of great importance for monitoring mental health. The Multimodal Sentiment Analysis Challenge (MuSe) 2021 focuses on emotion, physiological-emotion, and stress recognition as well as sentiment classification by exploiting several modalities. In this paper, we present our solution for the Muse-Stress sub-challenge. The target of this sub-challenge is continuous prediction of arousal and valence for people under stressful conditions where text transcripts, audio and video recordings are provided. To this end, we utilize bidirectional Long Short-Term Memory (LSTM) and Gated Recurrent Unit networks (GRU) to explore high-level and low-level features from different modalities. We employ Concordance Correlation Coefficient (CCC) as a loss function and evaluation metric for our model. To improve the unimodal predictions, we add difficulty indicators of the data obtained by using Auto-Encoders. Finally, we perform late fusion on our unimodal predictions in addition to the difficulty indicators to obtain our final predictions. With this approach, we achieve CCC of 0.4278 and 0.5951 for arousal and valence respectively on the test set, our submission to MuSe 2021 ranks in the top three for arousal, fourth for valence, and in top three for combined results.

[1]  Yiannis Kompatsiaris,et al.  MuSe 2020 Challenge and Workshop: Multimodal Sentiment Analysis, Emotion-target Engagement and Trustworthiness Detection in Real-life Media: Emotional Car Reviews in-the-wild , 2020, MuSe @ ACM Multimedia.

[2]  M. Gribaudo,et al.  2002 , 2001, Cell and Tissue Research.

[3]  Maja Pantic,et al.  Prediction-Based Audiovisual Fusion for Classification of Non-Linguistic Vocalisations , 2016, IEEE Transactions on Affective Computing.

[4]  Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge , 2015, AVEC@ACM Multimedia.

[5]  Björn W. Schuller,et al.  AVEC 2011-The First International Audio/Visual Emotion Challenge , 2011, ACII.

[6]  Yunhong Wang,et al.  DepAudioNet: An Efficient Deep Model for Audio based Depression Classification , 2016, AVEC@ACM Multimedia.

[7]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[8]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[9]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  C. Kirschbaum,et al.  The 'Trier Social Stress Test'--a tool for investigating psychobiological stress responses in a laboratory setting. , 1993, Neuropsychobiology.

[11]  Mingyue Niu,et al.  Automatic Depression Level Detection via ℓp-Norm Pooling , 2019, INTERSPEECH.

[12]  Shizhe Chen,et al.  Multimodal Multi-task Learning for Dimensional and Continuous Emotion Recognition , 2017, AVEC@ACM Multimedia.

[13]  Björn W. Schuller,et al.  Snore Sound Classification Using Image-Based Deep Spectrum Features , 2017, INTERSPEECH.

[14]  Prashant Parikh A Theory of Communication , 2010 .

[15]  Enrique Argones-Rúa,et al.  Audiovisual three-level fusion for continuous estimation of Russell's emotion circumplex , 2013, AVEC@ACM Multimedia.

[16]  Fabien Ringeval,et al.  AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge , 2016, AVEC@ACM Multimedia.

[17]  이광우 2015 , 2015, The Winning Cars of the Indianapolis 500.

[18]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  L. Lin,et al.  A concordance correlation coefficient to evaluate reproducibility. , 1989, Biometrics.

[20]  Fabien Ringeval,et al.  AVEC 2017: Real-life Depression, and Affect Recognition Workshop and Challenge , 2017, AVEC@ACM Multimedia.

[21]  D. Gabor,et al.  Theory of communication. Part 1: The analysis of information , 1946 .

[22]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[23]  Elaheh Yadegaridehkordi,et al.  Affective computing in education: A systematic review and future research , 2019, Comput. Educ..

[24]  Zheng Lian,et al.  Multi-modal Continuous Dimensional Emotion Recognition Using Recurrent Neural Network and Self-Attention Mechanism , 2020, MuSe @ ACM Multimedia.

[25]  J. Russell A circumplex model of affect. , 1980 .

[26]  Mohammad Soleymani,et al.  AVEC 2019 Workshop and Challenge: State-of-Mind, Detecting Depression with AI, and Cross-Cultural Affect Recognition , 2019, AVEC@MM.

[27]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[28]  Eduardo Coutinho,et al.  Dynamic Difficulty Awareness Training for Continuous Emotion Prediction , 2018, IEEE Transactions on Multimedia.

[29]  Björn W. Schuller,et al.  The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing , 2016, IEEE Transactions on Affective Computing.

[30]  Guoying Zhao,et al.  The MuSe 2021 Multimodal Sentiment Analysis Challenge: Sentiment, Emotion, Physiological-Emotion, and Stress , 2021, MuSe @ ACM Multimedia.

[31]  Erik Marchi,et al.  Deep Recurrent Neural Network-Based Autoencoders for Acoustic Novelty Detection , 2017, Comput. Intell. Neurosci..

[32]  Yoshua Bengio,et al.  Light Gated Recurrent Units for Speech Recognition , 2018, IEEE Transactions on Emerging Topics in Computational Intelligence.

[33]  Fabien Ringeval,et al.  AV+EC 2015: The First Affect Recognition Challenge Bridging Across Audio, Video, and Physiological Data , 2015, AVEC@ACM Multimedia.

[34]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[35]  Björn W. Schuller,et al.  AVEC 2012: the continuous audio/visual emotion challenge , 2012, ICMI '12.

[36]  Jian Huang,et al.  Multimodal Continuous Emotion Recognition with Data Augmentation Using Recurrent Neural Networks , 2018, AVEC@MM.

[37]  George Trigeorgis,et al.  End-to-End Multimodal Emotion Recognition Using Deep Neural Networks , 2017, IEEE Journal of Selected Topics in Signal Processing.

[38]  Dongmei Jiang,et al.  Efficient Spatial Temporal Convolutional Features for Audiovisual Continuous Affect Recognition , 2019, AVEC@MM.

[39]  P. Ekman,et al.  What the face reveals : basic and applied studies of spontaneous expression using the facial action coding system (FACS) , 2005 .

[40]  Fabien Ringeval,et al.  Summary for AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge , 2016, ACM Multimedia.

[41]  Cynthia Breazeal,et al.  Older Adults Living With Social Robots: Promoting Social Connectedness in Long-Term Communities , 2019, IEEE Robotics & Automation Magazine.

[42]  Qin Jin,et al.  Multi-modal Multi-cultural Dimensional Continues Emotion Recognition in Dyadic Interactions , 2018, AVEC@MM.

[43]  Andrew Zisserman,et al.  Deep Face Recognition , 2015, BMVC.

[44]  Joseph Lauber,et al.  1946 , 2019, The George Bell-Gerhard Leibholz Correspondence.

[45]  Björn W. Schuller,et al.  AVEC 2014: 3D Dimensional Affect and Depression Recognition Challenge , 2014, AVEC '14.

[46]  T. Dalgleish Basic Emotions , 2004 .

[47]  Björn W. Schuller,et al.  From Hard to Soft: Towards more Human-like Emotion Recognition by Modelling the Perception Uncertainty , 2017, ACM Multimedia.