A Hierarchical Transformer with Speaker Modeling for Emotion Recognition in Conversation

Emotion Recognition in Conversation (ERC) is a more challenging task than conventional text emotion recognition. It can be regarded as a personalized and interactive emotion recognition task, which is supposed to consider not only the semantic information of text but also the influences from speakers. The current method models speakers’ interactions by building a relation between every two speakers. However, this fine-grained but complicated modeling is computationally expensive, hard to extend, and can only consider local context. To address this problem, we simplify the complicated modeling to a binary version: Intra-Speaker and Inter-Speaker dependencies, without identifying every unique speaker for the targeted speaker. To better achieve the simplified interaction modeling of speakers in Transformer, which shows excellent ability to settle long-distance dependency, we design three types of masks and respectively utilize them in three independent Transformer blocks. The designed masks respectively model the conventional context modeling, Intra-Speaker dependency, and Inter-Speaker dependency. Furthermore, different speaker-aware information extracted by Transformer blocks diversely contributes to the prediction, and therefore we utilize the attention mechanism to automatically weight them. Experiments on two ERC datasets indicate that our model is efficacious to achieve better performance. Introduction Nowadays, intelligent machines to precisely capture speakers’ emotions in conversations are gaining popularity, thus driving the development of Emotion Recognition in Conversation (ERC). ERC is a task to predict the emotion of the current utterance expressed by a specific speaker according to the context (Poria et al. 2019b), which is more challenging than the conventional emotion recognition only considering semantic information of an independent utterance. To precisely predict the emotion of a targeted utterance, both the semantic information of the utterance and the information provided by utterances in the context are critical. Nowadays, a number of works (Hazarika et al. 2018a,b; Majumder et al. 2019; Ghosal et al. 2019) demonstrate that the interactions between speakers can facilitate extracting information from contextual utterances. We denote this kind of information with modeling speakers’ interactions as Copyright © 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Timeline Phoebe

[1]  Rada Mihalcea,et al.  MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations , 2018, ACL.

[2]  Ramesh Nallapati,et al.  Who did They Respond to? Conversation Structure Modeling using Masked Hierarchical Transformer , 2019, AAAI.

[3]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[4]  Xiaodong Liu,et al.  Unified Language Model Pre-training for Natural Language Understanding and Generation , 2019, NeurIPS.

[5]  Eduard Hovy,et al.  Emotion Recognition in Conversation: Research Challenges, Datasets, and Recent Advances , 2019, IEEE Access.

[6]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[7]  Jianhua Tao,et al.  Conversational Emotion Analysis via Attention Mechanisms , 2019, INTERSPEECH.

[8]  Zhe Wang,et al.  Hierarchical Transformer Network for Utterance-level Emotion Recognition , 2020, Applied Sciences.

[9]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[10]  Erik Cambria,et al.  Context-Dependent Sentiment Analysis in User-Generated Videos , 2017, ACL.

[11]  Erik Cambria,et al.  Conversational Memory Network for Emotion Recognition in Dyadic Dialogue Videos , 2018, NAACL.

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Michael R. Lyu,et al.  Real-Time Emotion Recognition via Attention Gated Hierarchical Memory Network , 2019, AAAI.

[14]  Max Welling,et al.  Modeling Relational Data with Graph Convolutional Networks , 2017, ESWC.

[15]  Michael R. Lyu,et al.  HiGRU: Hierarchical Gated Recurrent Units for Utterance-Level Emotion Recognition , 2019, NAACL.

[16]  Chunyan Miao,et al.  Knowledge-Enriched Transformer for Emotion Detection in Textual Conversations , 2019, EMNLP.

[17]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[18]  Alexander Gelbukh,et al.  DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation , 2019, EMNLP.

[19]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[20]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[21]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[22]  Guodong Zhou,et al.  Modeling both Context- and Speaker-Sensitive Dependence for Emotion Detection in Multi-speaker Conversations , 2019, IJCAI.

[23]  Yunming Ye,et al.  Enhancing Cross-target Stance Detection with Transferable Semantic-Emotion Knowledge , 2020, ACL.

[24]  Rada Mihalcea,et al.  ICON: Interactive Conversational Memory Network for Multimodal Emotion Detection , 2018, EMNLP.

[25]  Rada Mihalcea,et al.  DialogueRNN: An Attentive RNN for Emotion Detection in Conversations , 2018, AAAI.