A Contextual Attention Network for Multimodal Emotion Recognition in Conversation
暂无分享,去创建一个
Emotion recognition in conversation (ERC) is a challenging task due to the complexity of emotions and dynamics in dialogues. Current studies for emotion recognition mostly focus on the modeling of a single utterance in dialogue, which neglects self and inter-speaker influence. This paper presents a contextual attention neural network based on the multimodal framework that leverages the conversational information from both target and the other speaker for utterance-level emotion detection. Specifically, we utilize recurrent neural networks based on contextual attention for modeling the transaction and dependence between speakers. Further, the feature fusion is proposed to unite the important modal information extracted from multiple modalities, including audio, text and video, hence providing more useful and comprehensive knowledge for emotion recognition. The proposed approach shows its superiority in extracting contexts for self and inter-speaker influence and synthesizing them as global features that are beneficial to detect individual emotion state. Experiment result on the IEMOCAP corpus reports an accuracy of 64.6%, demonstrating the superiority of the proposed method in emotion recognition comparing to the state-of-the-arts.