VistaNet: Visual Aspect Attention Network for Multimodal Sentiment Analysis

Detecting the sentiment expressed by a document is a key task for many applications, e.g., modeling user preferences, monitoring consumer behaviors, assessing product quality. Traditionally, the sentiment analysis task primarily relies on textual content. Fueled by the rise of mobile phones that are often the only cameras on hand, documents on the Web (e.g., reviews, blog posts, tweets) are increasingly multimodal in nature, with photos in addition to textual content. A question arises whether the visual component could be useful for sentiment analysis as well. In this work, we propose Visual Aspect Attention Network or VistaNet, leveraging both textual and visual components. We observe that in many cases, with respect to sentiment detection, images play a supporting role to text, highlighting the salient aspects of an entity, rather than expressing sentiments independently of the text. Therefore, instead of using visual information as features, VistaNet relies on visual information as alignment for pointing out the important sentences of a document using attention. Experiments on restaurant reviews showcase the effectiveness of visual aspect attention, vis-a-vis visual features or textual attention.

[1]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[2]  Tao Chen,et al.  DeepSentiBank: Visual Sentiment Concept Classification with Deep Convolutional Neural Networks , 2014, ArXiv.

[3]  Huan Liu,et al.  Unsupervised sentiment analysis with emotional signals , 2013, WWW.

[4]  Erik Cambria,et al.  Tensor Fusion Network for Multimodal Sentiment Analysis , 2017, EMNLP.

[5]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[6]  Isabell M. Welpe,et al.  Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment , 2010, ICWSM.

[7]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[8]  Ting Liu,et al.  Document Modeling with Gated Recurrent Neural Network for Sentiment Classification , 2015, EMNLP.

[9]  Rongrong Ji,et al.  SentiBank: large-scale ontology and classifiers for detecting sentiment and emotions in visual content , 2013, ACM Multimedia.

[10]  Kiyoaki Shirai,et al.  PhraseRNN: Phrase Recursive Neural Network for Aspect-based Sentiment Analysis , 2015, EMNLP.

[11]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[12]  Jiebo Luo,et al.  Robust Image Sentiment Analysis Using Progressively Trained and Domain Transferred Deep Networks , 2015, AAAI.

[13]  Lei Zhang,et al.  A Survey of Opinion Mining and Sentiment Analysis , 2012, Mining Text Data.

[14]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[16]  Ting Liu,et al.  Aspect Level Sentiment Classification with Deep Memory Network , 2016, EMNLP.

[17]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[18]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[19]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[21]  Jonathon S. Hare,et al.  Analyzing and predicting sentiment of images on the social web , 2010, ACM Multimedia.

[22]  Jiebo Luo,et al.  Robust Visual-Textual Sentiment Analysis: When Attention meets Tree-structured Recursive Neural Networks , 2016, ACM Multimedia.

[23]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[24]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[25]  R. Desimone,et al.  Neural mechanisms of selective visual attention. , 1995, Annual review of neuroscience.

[26]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[27]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[28]  Quoc-Tuan Truong,et al.  Visual Sentiment Analysis for Review Images with Item-Oriented and User-Oriented CNN , 2017, ACM Multimedia.

[29]  Lillian Lee,et al.  Opinion Mining and Sentiment Analysis , 2008, Found. Trends Inf. Retr..

[30]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[31]  Tao Mei,et al.  Multi-level Attention Networks for Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[33]  Saif Mohammad,et al.  Sentiment Analysis of Short Informal Texts , 2014, J. Artif. Intell. Res..

[34]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[35]  Jiebo Luo,et al.  Cross-modality Consistent Regression for Joint Visual-Textual Sentiment Analysis of Social Multimedia , 2016, WSDM.

[36]  Johan Bollen,et al.  Modeling Public Mood and Emotion: Twitter Sentiment and Socio-Economic Phenomena , 2009, ICWSM.

[37]  Jun Zhao,et al.  Recurrent Convolutional Neural Networks for Text Classification , 2015, AAAI.