Multi-Modality Emotion Recognition Model with GAT-Based Multi-Head Inter-Modality Attention

Emotion recognition has been gaining attention in recent years due to its applications on artificial agents. To achieve a good performance with this task, much research has been conducted on the multi-modality emotion recognition model for leveraging the different strengths of each modality. However, a research question remains: what exactly is the most appropriate way to fuse the information from different modalities? In this paper, we proposed audio sample augmentation and an emotion-oriented encoder-decoder to improve the performance of emotion recognition and discussed an inter-modality, decision-level fusion method based on a graph attention network (GAT). Compared to the baseline, our model improved the weighted average F1-scores from 64.18 to 68.31% and the weighted average accuracy from 65.25 to 69.88%.

[1]  Neel Kant,et al.  Practical Text Classification With Large Pre-Trained Language Models , 2018, ArXiv.

[2]  Rada Mihalcea,et al.  DialogueRNN: An Attentive RNN for Emotion Detection in Conversations , 2018, AAAI.

[3]  Huicheng Zheng,et al.  Facial Expression Recognition Based on Region-Wise Attention and Geometry Difference , 2018, PRCV.

[4]  Yasar Amin,et al.  EEG-Based Multi-Modal Emotion Recognition using Bag of Deep Features: An Optimal Feature Selection Approach , 2019, Sensors.

[5]  Rada Mihalcea,et al.  Emotion Recognition in Conversations with Transfer Learning from Generative Conversation Modeling , 2019, ArXiv.

[6]  Yuichiro Yoshikawa,et al.  Sharing Experiences to Help a Robot Present Its Mind and Sociability , 2020, International Journal of Social Robotics.

[7]  Beat Fasel,et al.  Robust face analysis using convolutional neural networks , 2002, Object recognition supported by user interaction for service robots.

[8]  S. Tripathi,et al.  MULTI-MODAL EMOTION RECOGNITION ON IEMOCAP WITH NEURAL NETWORKS. , 2018 .

[9]  Ruchuan Wang,et al.  Speech Emotion Recognition Based on Multi-Task Learning , 2019, 2019 IEEE 5th Intl Conference on Big Data Security on Cloud (BigDataSecurity), IEEE Intl Conference on High Performance and Smart Computing, (HPSC) and IEEE Intl Conference on Intelligent Data and Security (IDS).

[10]  Snehasis Mukherjee,et al.  Spontaneous Facial Micro-Expression Recognition using 3D Spatiotemporal Convolutional Neural Networks , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[11]  Takuya Maekawa,et al.  Similarity of Speech Emotion in Different Languages Revealed by a Neural Network with Attention , 2020, 2020 IEEE 14th International Conference on Semantic Computing (ICSC).

[12]  Ragini Verma,et al.  Prosodic cues for emotion: analysis with discrete characterization of intonation. , 2014, Speech prosody.

[13]  Haohao Li,et al.  Facial Expression Recognition: Disentangling Expression Based on Self-attention Conditional Generative Adversarial Nets , 2019, PRCV.

[14]  Eric Patterson,et al.  Emotion recognition using facial expressions with active appearance models , 2008 .

[15]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[16]  Marina L. Gavrilova,et al.  Two-Layer Feature Selection Algorithm for Recognizing Human Emotions from 3D Motion Analysis , 2019, CGI.

[17]  Emily Mower Provost,et al.  Discretized Continuous Speech Emotion Recognition with Multi-Task Deep Recurrent Neural Network , 2017, INTERSPEECH.

[18]  Matthias Scheutz,et al.  First steps toward natural human-like HRI , 2007, Auton. Robots.

[19]  Jean-Yves Didier,et al.  Human motions and emotions recognition inspired by LMA qualities , 2018, The Visual Computer.

[20]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[21]  Ron Hoory,et al.  Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms , 2017, INTERSPEECH.

[22]  Ya Li,et al.  The CASIA Audio Emotion Recognition Method for Audio/Visual Emotion Challenge 2011 , 2011, ACII.

[23]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[24]  Yang Liu,et al.  A Multi-Task Learning Framework for Emotion Recognition Using 2D Continuous Space , 2017, IEEE Transactions on Affective Computing.

[25]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[26]  Yueli Cui,et al.  Learning Affective Video Features for Facial Expression Recognition via Hybrid Deep Learning , 2019, IEEE Access.

[27]  Efthymios Tzinis,et al.  Segment-based speech emotion recognition using recurrent neural networks , 2017, 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII).

[28]  N. Anand,et al.  Convoluted Feelings Convolutional and recurrent nets for detecting emotion from audio data , 2015 .

[29]  Buket D. Barkana,et al.  Deep Convolutional Neural Network for Age Estimation based on VGG-Face Model , 2017, ArXiv.

[30]  Chi-Chun Lee,et al.  Attentive to Individual: A Multimodal Emotion Recognition Network with Personalized Attention Profile , 2019, INTERSPEECH.

[31]  Michael Wagner,et al.  A multilevel fusion approach for audiovisual emotion recognition , 2008, AVSP.

[32]  Deshun Yang,et al.  Multimodel Music Emotion Recognition Using Unsupervised Deep Neural Networks , 2019, Lecture Notes in Electrical Engineering.

[33]  Kolja Kühnlenz,et al.  Improving aspects of empathy and subjective performance for HRI through mirroring facial expressions , 2011, RO-MAN.

[34]  Jinho D. Choi,et al.  Emotion Detection on TV Show Transcripts with Sequence-based Convolutional Neural Networks , 2017, AAAI Workshops.

[35]  Yuichiro Yoshikawa,et al.  SeMemNN: A Semantic Matrix-Based Memory Neural Network for Text Classification , 2020, 2020 IEEE 14th International Conference on Semantic Computing (ICSC).

[36]  Rui Xia,et al.  Leveraging valence and activation information via multi-task learning for categorical emotion recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  C. L. Philip Chen,et al.  GCB-Net: Graph Convolutional Broad Network and Its Application in Emotion Recognition , 2019, IEEE Transactions on Affective Computing.

[38]  Yuichiro Yoshikawa,et al.  SeMemNN: A Semantic Matrix-Based Memory Neural Network for Text Classification , 2020, ArXiv.

[39]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[40]  Dimitrios Kollias,et al.  Expression, Affect, Action Unit Recognition: Aff-Wild2, Multi-Task Learning and ArcFace , 2019, BMVC.

[41]  Jun Li,et al.  Short Text Emotion Analysis Based on Recurrent Neural Network , 2017, ICIE '17.

[42]  Björn W. Schuller,et al.  Attention-augmented End-to-end Multi-task Learning for Emotion Prediction from Speech , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Keun-Chang Kwak,et al.  Facial Expression Recognition Using 3D Convolutional Neural Network , 2014 .

[44]  Roman Klinger,et al.  IMS at EmoInt-2017: Emotion Intensity Prediction with Affective Norms, Automatically Extended Resources and Deep Learning , 2017, WASSA@EMNLP.

[45]  Beat Fasel,et al.  Head-pose invariant facial expression recognition using convolutional neural networks , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[46]  Giovanni Costantini,et al.  EMOVO Corpus: an Italian Emotional Speech Database , 2014, LREC.

[47]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[48]  Gaurav Sahu,et al.  Multimodal Speech Emotion Recognition and Ambiguity Resolution , 2019, ArXiv.

[49]  Alexander Gelbukh,et al.  DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation , 2019, EMNLP.

[50]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Reza Lotfian,et al.  Curriculum Learning for Speech Emotion Recognition From Crowdsourced Labels , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[52]  Ning An,et al.  Speech Emotion Recognition Using Fourier Parameters , 2015, IEEE Transactions on Affective Computing.

[53]  Emily Mower Provost,et al.  Cross-corpus acoustic emotion recognition from singing and speaking: A multi-task learning approach , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[54]  Fabio Pianesi,et al.  DaFEx: Database of Facial Expressions , 2005, INTETAIN.

[55]  Erik Cambria,et al.  Conversational Memory Network for Emotion Recognition in Dyadic Dialogue Videos , 2018, NAACL.

[56]  Shashidhar G. Koolagudi,et al.  Emotion recognition from speech using global and local prosodic features , 2013, Int. J. Speech Technol..

[57]  Katarzyna Wac,et al.  Multimodal Integration of Emotional Signals from Voice, Body, and Context: Effects of (In)Congruence on Emotion Recognition and Attitudes Towards Robots , 2019, Int. J. Soc. Robotics.

[58]  Ya Li,et al.  Improving generation performance of speech emotion recognition by denoising autoencoders , 2014, The 9th International Symposium on Chinese Spoken Language Processing.