An Improved Model of Multi-attention LSTM for Multimodal Sentiment Analysis

Multimodal sentiment analysis is the task of detecting emotions in videos using multimodal information such as text, visual and audio. One difficulty that is often faced is the complexity associated with different modes in the fusion of feature layers. In this paper, we present a novel feature-level fusion method for analyzing emotions called Multi-attention LSTM (MALM). The proposed approach uses LSTM to capture context information of contexts in the same video. At the same time, we use the attention mechanism before and after multimodal information fusion in order to focus attention on relatively important sequences and modalities. We evaluate our proposed approach on two multi-modal sentiment analysis benchmark datasets and compare to various proposed approaches on the same datasets. Evaluation results show approximately 5-10% performance improvement over the state-of-the-art models for the benchmark datasets.

[1]  Erik Cambria,et al.  Memory Fusion Network for Multi-view Sequential Learning , 2018, AAAI.

[2]  Erik Cambria,et al.  Tensor Fusion Network for Multimodal Sentiment Analysis , 2017, EMNLP.

[3]  P. Ekman Facial expression and emotion. , 1993, The American psychologist.

[4]  Cecilia Ovesdotter Alm,et al.  Emotions from Text: Machine Learning for Text-based Emotion Prediction , 2005, HLT.

[5]  Erik Cambria,et al.  A review of affective computing: From unimodal analysis to multimodal fusion , 2017, Inf. Fusion.

[6]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[7]  Erik Cambria,et al.  Context-Dependent Sentiment Analysis in User-Generated Videos , 2017, ACL.

[8]  Erik Cambria,et al.  Multi-level Multiple Attentions for Contextual Multimodal Sentiment Analysis , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[9]  Louis-Philippe Morency,et al.  Multimodal Sentiment Intensity Analysis in Videos: Facial Gestures and Verbal Messages , 2016, IEEE Intelligent Systems.

[10]  Erik Cambria,et al.  Deep Convolutional Neural Network Textual Features and Multiple Kernel Learning for Utterance-level Multimodal Sentiment Analysis , 2015, EMNLP.

[11]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Rosalind W. Picard Affective Computing: From Laughter to IEEE , 2010 .

[13]  Erik Cambria,et al.  Multi-attention Recurrent Network for Human Communication Comprehension , 2018, AAAI.

[14]  Iryna Gurevych,et al.  Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , 2018, ACL 2018.

[15]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[16]  Yee Whye Teh,et al.  Rate-coded Restricted Boltzmann Machines for Face Recognition , 2000, NIPS.

[17]  Pushpak Bhattacharyya,et al.  Contextual Inter-modal Attention for Multi-modal Sentiment Analysis , 2018, EMNLP.

[18]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[19]  D. Olson From Utterance to Text: The Bias of Language in Speech and Writing , 1977 .

[20]  Léon J. M. Rothkrantz,et al.  Emotion recognition using bimodal data fusion , 2011, CompSysTech '11.