Boosting image sentiment analysis with visual attention

Abstract Sentiment analysis plays an important role in behavior sciences, which aims to determine the attitude of a speaker or a writer regarding some topic or the overall contextual polarity of a document. The problem nevertheless is not trivial, especially when inferring sentiment or emotion from visual contents, such as images and videos, which are becoming pervasive on the Web. Observing that the sentiment of an image may be reflected only by some spatial regions, a valid question is how to locate the attended spatial areas for enhancing image sentiment analysis. In this paper, we present Sentiment Networks with visual Attention (SentiNet-A) — a novel architecture that integrates visual attention into the successful Convolutional Neural Networks (CNN) sentiment classification framework, by training them in an end-to-end manner. To model visual attention, we develop multiple layers to generate the attention distribution over the regions of the image. Furthermore, the saliency map of the image is employed as a priori knowledge and regularizer to holistically refine the attention distribution for sentiment prediction. Extensive experiments are conducted on both Twitter and ARTphoto benchmarks, and our framework achieves superior results when compared to the state-of-the-art techniques.

[1]  Nanning Zheng,et al.  Learning to Detect a Salient Object , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Tao Mei,et al.  Beyond Object Recognition: Visual Sentiment Analysis with Deep Coupled Adjective and Noun Neural Networks , 2016, IJCAI.

[3]  Geoffrey E. Hinton,et al.  Learning to combine foveal glimpses with a third-order Boltzmann machine , 2010, NIPS.

[4]  Jiebo Luo,et al.  Visual Sentiment Analysis by Attending on Local Image Regions , 2017, AAAI.

[5]  Paul L. Rosin,et al.  Visual Sentiment Prediction Based on Automatic Discovery of Affective Regions , 2018, IEEE Transactions on Multimedia.

[6]  Affective content detection in sitcom using subtitle and audio , 2006, 2006 12th International Multi-Media Modelling Conference.

[7]  James Ze Wang,et al.  On shape and the computability of emotions , 2012, ACM Multimedia.

[8]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[9]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[10]  Harish Katti,et al.  CAVVA: Computational Affective Video-in-Video Advertising , 2014, IEEE Transactions on Multimedia.

[11]  Yu Ying-lin,et al.  Image Retrieval by Emotional Semantics: A Study of Emotional Space and Feature Extraction , 2006, 2006 IEEE International Conference on Systems, Man and Cybernetics.

[12]  Rongrong Ji,et al.  SentiBank: large-scale ontology and classifiers for detecting sentiment and emotions in visual content , 2013, ACM Multimedia.

[13]  Jiebo Luo,et al.  Joint Visual-Textual Sentiment Analysis with Deep Neural Networks , 2015, ACM Multimedia.

[14]  Amaia Salvador,et al.  Diving Deep into Sentiment: Understanding Fine-tuned CNNs for Visual Sentiment Prediction , 2015, ASM@ACM Multimedia.

[15]  Lianhong Cai,et al.  Interpretable aesthetic features for affective image classification , 2013, 2013 IEEE International Conference on Image Processing.

[16]  Yue Gao,et al.  Exploring Principles-of-Art Features For Image Emotion Recognition , 2014, ACM Multimedia.

[17]  Jun Wang,et al.  Multiple Emotion Tagging for Multimedia Data by Exploiting High-Order Dependencies Among Emotions , 2015, IEEE Transactions on Multimedia.

[18]  Chong-Wah Ngo,et al.  Deep Multimodal Learning for Affective Analysis and Retrieval , 2015, IEEE Transactions on Multimedia.

[19]  Jurij F. Tasic,et al.  Affective Labeling in a Content-Based Recommender System for Images , 2013, IEEE Transactions on Multimedia.

[20]  Bo Zhao,et al.  Diversified Visual Attention Networks for Fine-Grained Object Classification , 2016, IEEE Transactions on Multimedia.

[21]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[22]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Heng Tao Shen,et al.  Video Captioning With Attention-Based LSTM and Semantic Consistency , 2017, IEEE Transactions on Multimedia.

[24]  Yue Gao,et al.  Predicting Personalized Emotion Perceptions of Social Images , 2016, ACM Multimedia.

[25]  Allan Hanbury,et al.  Affective image classification using features inspired by psychology and art theory , 2010, ACM Multimedia.

[26]  Jiebo Luo,et al.  Robust Image Sentiment Analysis Using Progressively Trained and Domain Transferred Deep Networks , 2015, AAAI.

[27]  Nitish Srivastava,et al.  Learning Generative Models with Visual Attention , 2013, NIPS.

[28]  Jiebo Luo,et al.  Sentribute: image sentiment analysis from a mid-level perspective , 2013, WISDOM '13.

[29]  Yizhou Yu,et al.  Deep Contrast Learning for Salient Object Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Yue Gao,et al.  Learning Visual Emotion Distributions via Multi-Modal Features Fusion , 2017, ACM Multimedia.

[31]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[33]  Yue Gao,et al.  Continuous Probability Distribution Prediction of Image Emotions via Multitask Shared Sparse Regression , 2017, IEEE Transactions on Multimedia.

[34]  Jufeng Yang,et al.  Joint Image Emotion Classification and Distribution Learning via Deep Convolutional Neural Network , 2017, IJCAI.

[35]  Yue Gao,et al.  Predicting Personalized Image Emotion Perceptions in Social Networks , 2018, IEEE Transactions on Affective Computing.

[36]  Min Xu,et al.  Multi-scale blocks based image emotion classification using multiple instance learning , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[37]  Juan-Zi Li,et al.  How Do Your Friends on Social Media Disclose Your Emotions? , 2014, AAAI.

[38]  Jonathon S. Hare,et al.  Analyzing and predicting sentiment of images on the social web , 2010, ACM Multimedia.

[39]  Youbao Tang,et al.  Discrete Probability Distribution Prediction of Image Emotions with Shared Sparse Learning , 2020, IEEE Transactions on Affective Computing.

[40]  Sanghoon Lee,et al.  Transition of Visual Attention Assessment in Stereoscopic Images With Evaluation of Subjective Visual Quality and Discomfort , 2015, IEEE Transactions on Multimedia.

[41]  Alberto Del Bimbo,et al.  A multimodal feature learning approach for sentiment analysis of social network multimedia , 2016, Multimedia Tools and Applications.

[42]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[43]  Xiangyang Xue,et al.  Predicting Emotions in User-Generated Videos , 2014, AAAI.

[44]  Yue Gao,et al.  Approximating Discrete Probability Distribution of Image Emotions by Multi-Modal Features Fusion , 2017, IJCAI.

[45]  Qingming Huang,et al.  Dependency Exploitation: A Unified CNN-RNN Approach for Visual Emotion Recognition , 2017, IJCAI.

[46]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Min Xu,et al.  Improving Visual Saliency Computing With Emotion Intensity , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[49]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[50]  Min Xu,et al.  Generating affective maps for images , 2017, Multimedia Tools and Applications.

[51]  Nicu Sebe,et al.  Emotional valence categorization using holistic image features , 2008, 2008 15th IEEE International Conference on Image Processing.

[52]  Meng Wang,et al.  VideoWhisper: Toward Discriminative Unsupervised Video Feature Learning With Attention-Based Recurrent Neural Networks , 2017, IEEE Transactions on Multimedia.

[53]  Erik Cambria,et al.  Fusing audio, visual and textual clues for sentiment analysis from multimodal content , 2016, Neurocomputing.