Transformer-based Label Set Generation for Multi-modal Multi-label Emotion Detection

Multi-modal utterance-level emotion detection has been a hot research topic in both multi-modal analysis and natural language processing communities. Different from traditional single-label multi-modal sentiment analysis, typical multi-modal emotion detection is naturally a multi-label problem where an utterance often contains multiple emotions. Existing studies normally focus on multi-modal fusion only and transform multi-label emotion classification into multiple binary classification problem independently. As a result, existing studies largely ignore two kinds of important dependency information: (1) Modality-to-label dependency, where different emotions can be inferred from different modalities, that is, different modalities contribute differently to each potential emotion. (2) Label-to-label dependency, where some emotions are more likely to coexist than those conflicting emotions. To simultaneously model above two kinds of dependency, we propose a unified approach, namely multi-modal emotion set generation network (MESGN) to generate an emotion set for an utterance. Specifically, we first employ a cross-modal transformer encoder to capture cross-modal interactions among different modalities, and a standard transformer encoder to capture temporal information for each modality-specific sequence given previous interactions. Then, we design a transformer-based discriminative decoding module equipped with modality-to-label attention to handle the modality-to-label dependency. In the meanwhile, we employ a reinforced decoding algorithm with self-critic learning to handle the label-to-label dependency. Finally, we validate the proposed MESGN architecture on a word-level aligned and unaligned multi-modal dataset. Detailed experimentation shows that our proposed MESGN architecture can effectively improve the performance of multi-modal multi-label emotion detection.

[1]  Guodong Zhou,et al.  Emotion Detection with Neural Personal Discrimination , 2019, EMNLP.

[2]  Neel Kant,et al.  Practical Text Classification With Large Pre-Trained Language Models , 2018, ArXiv.

[3]  Vaibhava Goel,et al.  Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Grigorios Tsoumakas,et al.  Multi-Label Classification , 2009, Database Technologies: Concepts, Methodologies, Tools, and Applications.

[5]  Geoff Holmes,et al.  Classifier chains for multi-label classification , 2009, Machine Learning.

[6]  Xiu-Shen Wei,et al.  Multi-Label Image Recognition With Graph Convolutional Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Erik Cambria,et al.  Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph , 2018, ACL.

[8]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[9]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[10]  Guodong Zhou,et al.  Modeling both Context- and Speaker-Sensitive Dependence for Emotion Detection in Multi-speaker Conversations , 2019, IJCAI.

[11]  Pushpak Bhattacharyya,et al.  Context-aware Interactive Attention for Multi-modal Sentiment and Emotion Analysis , 2019, EMNLP.

[12]  Louis-Philippe Morency,et al.  Words Can Shift: Dynamically Adjusting Word Representations Using Nonverbal Behaviors , 2018, AAAI.

[13]  Jiebo Luo,et al.  Multilabel machine learning and its application to semantic scene classification , 2003, IS&T/SPIE Electronic Imaging.

[14]  Kyomin Jung,et al.  AttnConvnet at SemEval-2018 Task 1: Attention-based Convolutional Neural Networks for Multi-label Emotion Classification , 2018, SemEval@NAACL-HLT.

[15]  Alexander M. Rush,et al.  Sequence-to-Sequence Learning as Beam-Search Optimization , 2016, EMNLP.

[16]  Guodong Zhou,et al.  Modeling the Clause-Level Structure to Multimodal Sentiment Analysis via Reinforcement Learning , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[17]  Ruslan Salakhutdinov,et al.  Multimodal Transformer for Unaligned Multimodal Language Sequences , 2019, ACL.

[18]  Rong Xiang,et al.  Improving Multi-label Emotion Classification by Integrating both General and Domain-specific Knowledge , 2019, EMNLP.

[19]  Ivan Marsic,et al.  Mutual Correlation Attentive Factors in Dyadic Fusion Networks for Speech Emotion Recognition , 2019, ACM Multimedia.

[20]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[21]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[22]  Guodong Zhou,et al.  Multi-Modal Language Analysis with Hierarchical Interaction-Level and Selection-Level Attentions , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[23]  Muhammad Abdul-Mageed,et al.  EmoNet: Fine-Grained Emotion Detection with Gated Recurrent Neural Networks , 2017, ACL.

[24]  Alexander Gelbukh,et al.  DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation , 2019, EMNLP.

[25]  Wenhan Xiong,et al.  Learning to Learn and Predict: A Meta-Learning Approach for Multi-Label Classification , 2019, EMNLP.

[26]  Shuming Ma,et al.  A Deep Reinforced Sequence-to-Set Model for Multi-Label Classification , 2019, ACL.

[27]  Stefan Rank,et al.  Modelling Emotional Trajectories of Individuals in an Online Chat , 2012, MATES.

[28]  Wei Wu,et al.  SGM: Sequence Generation Model for Multi-label Classification , 2018, COLING.

[29]  Erik Cambria,et al.  Multi-attention Recurrent Network for Human Communication Comprehension , 2018, AAAI.

[30]  Lin Xiao,et al.  Label-Specific Document Representation for Multi-Label Text Classification , 2019, EMNLP.

[31]  Marc'Aurelio Ranzato,et al.  Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[32]  Lei Huang,et al.  Sentence-level Emotion Classification with Label and Context Dependence , 2015, ACL.

[33]  Jianfei Yu,et al.  Improving Multi-label Emotion Classification via Sentiment Classification with Dual Attention Transfer Network , 2018, EMNLP.

[34]  Guodong Zhou,et al.  Effective Sentiment-relevant Word Selection for Multi-modal Sentiment Analysis in Spoken Language , 2019, ACM Multimedia.

[35]  Erik Cambria,et al.  Memory Fusion Network for Multi-view Sequential Learning , 2018, AAAI.

[36]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.