Joint Visual-Textual Sentiment Analysis Based on Cross-Modality Attention Mechanism

Recently, many researchers have focused on the joint visual-textual sentiment analysis since it can better extract user sentiments toward events or topics. In this paper, we propose that visual and textual information should differ in their contribution to sentiment analysis. Our model learns a robust joint visual-textual representation by incorporating a cross-modality attention mechanism and semantic embedding learning based on bidirectional recurrent neural network. Experimental results show that our model outperforms existing the state-of-the-art models in sentiment analysis under real datasets. In addition, we also investigate different proposed model’s variants and analyze the effects of semantic embedding learning and cross-modality attention mechanism in order to provide deeper insight on how these two techniques help the learning of joint visual-textual sentiment classifier.

[1]  Jiebo Luo,et al.  Robust Visual-Textual Sentiment Analysis: When Attention meets Tree-structured Recursive Neural Networks , 2016, ACM Multimedia.

[2]  Chong-Wah Ngo,et al.  Deep Multimodal Learning for Affective Analysis and Retrieval , 2015, IEEE Transactions on Multimedia.

[3]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[4]  Tao Chen,et al.  Understanding and classifying image tweets , 2013, ACM Multimedia.

[5]  Rongrong Ji,et al.  Microblog Sentiment Analysis Based on Cross-media Bag-of-words Model , 2014, ICIMCS '14.

[6]  Amaia Salvador,et al.  Diving Deep into Sentiment: Understanding Fine-tuned CNNs for Visual Sentiment Prediction , 2015, ASM@ACM Multimedia.

[7]  Jiebo Luo,et al.  Cross-modality Consistent Regression for Joint Visual-Textual Sentiment Analysis of Social Multimedia , 2016, WSDM.

[8]  Liwei Wang,et al.  Learning Two-Branch Neural Networks for Image-Text Matching Tasks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[10]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[11]  Yunhong Wang,et al.  Visual and textual sentiment analysis using deep fusion convolutional neural networks , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[12]  Jiebo Luo,et al.  Visual Sentiment Analysis by Attending on Local Image Regions , 2017, AAAI.

[13]  Dumitru Erhan,et al.  Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[15]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[16]  Rongrong Ji,et al.  Large-scale visual sentiment ontology and detectors using adjective noun pairs , 2013, ACM Multimedia.

[17]  Shin'ichi Satoh,et al.  Image sentiment analysis using latent correlations among visual, textual, and sentiment views , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Lin Ma,et al.  Multimodal Convolutional Neural Networks for Matching Image and Sentence , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[19]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[20]  Rongrong Ji,et al.  A cross-media public sentiment analysis system for microblog , 2014, Multimedia Systems.

[21]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[22]  Jiebo Luo,et al.  Towards social imagematics: sentiment analysis in social multimedia , 2013, MDMKDD '13.

[23]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).