论文信息 - A Hierarchical Transformer with Speaker Modeling for Emotion Recognition in Conversation

A Hierarchical Transformer with Speaker Modeling for Emotion Recognition in Conversation

Emotion Recognition in Conversation (ERC) is a more challenging task than conventional text emotion recognition. It can be regarded as a personalized and interactive emotion recognition task, which is supposed to consider not only the semantic information of text but also the influences from speakers. The current method models speakers’ interactions by building a relation between every two speakers. However, this fine-grained but complicated modeling is computationally expensive, hard to extend, and can only consider local context. To address this problem, we simplify the complicated modeling to a binary version: Intra-Speaker and Inter-Speaker dependencies, without identifying every unique speaker for the targeted speaker. To better achieve the simplified interaction modeling of speakers in Transformer, which shows excellent ability to settle long-distance dependency, we design three types of masks and respectively utilize them in three independent Transformer blocks. The designed masks respectively model the conventional context modeling, Intra-Speaker dependency, and Inter-Speaker dependency. Furthermore, different speaker-aware information extracted by Transformer blocks diversely contributes to the prediction, and therefore we utilize the attention mechanism to automatically weight them. Experiments on two ERC datasets indicate that our model is efficacious to achieve better performance. Introduction Nowadays, intelligent machines to precisely capture speakers’ emotions in conversations are gaining popularity, thus driving the development of Emotion Recognition in Conversation (ERC). ERC is a task to predict the emotion of the current utterance expressed by a specific speaker according to the context (Poria et al. 2019b), which is more challenging than the conventional emotion recognition only considering semantic information of an independent utterance. To precisely predict the emotion of a targeted utterance, both the semantic information of the utterance and the information provided by utterances in the context are critical. Nowadays, a number of works (Hazarika et al. 2018a,b; Majumder et al. 2019; Ghosal et al. 2019) demonstrate that the interactions between speakers can facilitate extracting information from contextual utterances. We denote this kind of information with modeling speakers’ interactions as Copyright © 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Timeline Phoebe