Robust Visual-Textual Sentiment Analysis: When Attention meets Tree-structured Recursive Neural Networks

Sentiment analysis is crucial for extracting social signals from social media content. Due to huge variation in social media, the performance of sentiment classifiers using single modality (visual or textual) still lags behind satisfaction. In this paper, we propose a new framework that integrates textual and visual information for robust sentiment analysis. Different from previous work, we believe visual and textual information should be treated jointly in a structural fashion. Our system first builds a semantic tree structure based on sentence parsing, aimed at aligning textual words and image regions for accurate analysis. Next, our system learns a robust joint visual-textual semantic representation by incorporating 1) an attention mechanism with LSTM (long short term memory) and 2) an auxiliary semantic learning task. Extensive experimental results on several known data sets show that our method outperforms existing the state-of-the-art joint models in sentiment analysis. We also investigate different tree-structured LSTM (T-LSTM) variants and analyze the effect of the attention mechanism in order to provide deeper insight on how the attention mechanism helps the learning of the joint visual-textual sentiment classifier.

[1]  Lin Ma,et al.  Multimodal Convolutional Neural Networks for Matching Image and Sentence , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  Baoxin Li,et al.  Unsupervised Sentiment Analysis for Social Media Images , 2015, IJCAI.

[3]  Jiebo Luo,et al.  Cross-modality Consistent Regression for Joint Visual-Textual Sentiment Analysis of Social Multimedia , 2016, WSDM.

[4]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Xinlei Chen,et al.  Mind's eye: A recurrent visual representation for image caption generation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Jürgen Schmidhuber,et al.  Learning Precise Timing with LSTM Recurrent Networks , 2003, J. Mach. Learn. Res..

[7]  Andrew Y. Ng,et al.  Parsing Natural Scenes and Natural Language with Recursive Neural Networks , 2011, ICML.

[8]  Rongrong Ji,et al.  A cross-media public sentiment analysis system for microblog , 2014, Multimedia Systems.

[9]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[10]  Geoffrey E. Hinton,et al.  Training Recurrent Neural Networks , 2013 .

[11]  Quoc V. Le,et al.  Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.

[12]  Hongyu Guo,et al.  Long Short-Term Memory Over Recursive Structures , 2015, ICML.

[13]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[14]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[15]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[16]  Navdeep Jaitly,et al.  Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[17]  Baoxin Li,et al.  Inferring Sentiment from Web Images with Joint Inference on Visual and Social Cues: A Regulated Matrix Factorization Approach , 2021, ICWSM.

[18]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[19]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[20]  James Zijun Wang,et al.  RAPID: Rating Pictorial Aesthetics using Deep Learning , 2014, ACM Multimedia.

[21]  Jiebo Luo,et al.  Robust Image Sentiment Analysis Using Progressively Trained and Domain Transferred Deep Networks , 2015, AAAI.

[22]  Sebastian Nowozin,et al.  Structured Learning and Prediction in Computer Vision , 2011, Found. Trends Comput. Graph. Vis..

[23]  F. Gers,et al.  Long short-term memory in recurrent neural networks , 2001 .

[24]  Rongrong Ji,et al.  Microblog Sentiment Analysis Based on Cross-media Bag-of-words Model , 2014, ICIMCS '14.

[25]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[26]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Eric Gilbert,et al.  VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text , 2014, ICWSM.

[28]  Rongrong Ji,et al.  Large-scale visual sentiment ontology and detectors using adjective noun pairs , 2013, ACM Multimedia.

[29]  Claire Cardie,et al.  39. Opinion mining and sentiment analysis , 2014 .

[30]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[32]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[33]  Jonathon S. Hare,et al.  Analyzing and predicting sentiment of images on the social web , 2010, ACM Multimedia.

[34]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[35]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[38]  Jiebo Luo,et al.  Sentribute: image sentiment analysis from a mid-level perspective , 2013, WISDOM '13.

[39]  Jason Weston,et al.  WSABIE: Scaling Up to Large Vocabulary Image Annotation , 2011, IJCAI.

[40]  Rongrong Ji,et al.  SentiBank: large-scale ontology and classifiers for detecting sentiment and emotions in visual content , 2013, ACM Multimedia.

[41]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[42]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[43]  Mohamed Chtourou,et al.  On the training of recurrent neural networks , 2011, Eighth International Multi-Conference on Systems, Signals & Devices.

[44]  Yi Yang,et al.  Articulated pose estimation with flexible mixtures-of-parts , 2011, CVPR 2011.

[45]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.