A Multi-modal Approach to Fine-grained Opinion Mining on Video Reviews

Despite the recent advances in opinion mining for written reviews, few works have tackled the problem on other sources of reviews. In light of this issue, we propose a multi-modal approach for mining fine-grained opinions from video reviews that is able to determine the aspects of the item under review that are being discussed and the sentiment orientation towards them. Our approach works at the sentence level without the need for time annotations and uses features derived from the audio, video and language transcriptions of its contents. We evaluate our approach on two datasets and show that leveraging the video and audio modalities consistently provides increased performance over text-only baselines, providing evidence these extra modalities are key in better understanding video reviews.

[1]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Eduard H. Hovy,et al.  End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.

[3]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[4]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Yutaka Matsuo,et al.  Mining fine-grained opinions on closed captions of YouTube videos with an attention-RNN , 2017, WASSA@EMNLP.

[6]  Raymond J. Mooney,et al.  Using closed captions to train activity recognizers that improve video retrieval , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[7]  John Kane,et al.  COVAREP — A collaborative voice analysis repository for speech technologies , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[9]  Xinlei Chen,et al.  Sense discovery via co-clustering on images and text , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Barnabás Póczos,et al.  Seq2Seq2Sentiment: Multimodal Sequence to Sequence Models for Sentiment Analysis , 2018, ArXiv.

[11]  Jason Weston,et al.  Reading Wikipedia to Answer Open-Domain Questions , 2017, ACL.

[12]  Yue Zhang,et al.  Design Challenges and Misconceptions in Neural Sequence Labeling , 2018, COLING.

[13]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[14]  Yue Zhang,et al.  Neural Networks for Open Domain Targeted Sentiment , 2015, EMNLP.

[15]  Ting Liu,et al.  Aspect Level Sentiment Classification with Deep Memory Network , 2016, EMNLP.

[16]  Qiang Yang,et al.  Crowdsourced time-sync video tagging using temporal and personalized topic modeling , 2014, KDD.

[17]  Angeliki Lazaridou,et al.  Combining Language and Vision with a Multimodal Skip-gram Model , 2015, NAACL.

[18]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[19]  Claire Cardie,et al.  Opinion Mining with Deep Recurrent Neural Networks , 2014, EMNLP.

[20]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Desmond Elliott,et al.  Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description , 2017, WMT.

[22]  Chloé Clavel,et al.  A multimodal movie review corpus for fine-grained opinion mining , 2019, ArXiv.

[23]  Alessandro Moschitti,et al.  Opinion Mining on YouTube , 2014, ACL.

[24]  Wolfgang Nejdl,et al.  How useful are your comments?: analyzing and predicting youtube comments and comment ratings , 2010, WWW '10.

[25]  Pushpak Bhattacharyya,et al.  Contextual Inter-modal Attention for Multi-modal Sentiment Analysis , 2018, EMNLP.

[26]  Chloé Clavel,et al.  From the Token to the Review: A Hierarchical Multimodal approach to Opinion Mining , 2019, EMNLP/IJCNLP.

[27]  Yoshua Bengio,et al.  Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding , 2013, INTERSPEECH.

[28]  Louis-Philippe Morency,et al.  MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos , 2016, ArXiv.

[29]  Yuan Zuo,et al.  Complementary Aspect-Based Opinion Mining , 2018, IEEE Transactions on Knowledge and Data Engineering.

[30]  Raymond J. Mooney,et al.  Using closed captions as supervision for video activity recognition , 2010, AAAI 2010.

[31]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[32]  Diane J. Cook,et al.  Using Closed Captions and Visual Features to Classify Movies by Genre , 2006 .

[33]  Suresh Manandhar,et al.  SemEval-2014 Task 4: Aspect Based Sentiment Analysis , 2014, *SEMEVAL.

[34]  Erik Cambria,et al.  Deep Convolutional Neural Network Textual Features and Multiple Kernel Learning for Utterance-level Multimodal Sentiment Analysis , 2015, EMNLP.

[35]  Shafiq R. Joty,et al.  Fine-grained Opinion Mining with Recurrent Neural Networks and Word Embeddings , 2015, EMNLP.

[36]  Benjamin Van Durme,et al.  Open Domain Targeted Sentiment , 2013, EMNLP.

[37]  Erik Cambria,et al.  Tensor Fusion Network for Multimodal Sentiment Analysis , 2017, EMNLP.

[38]  Hongfei Yan,et al.  Jointly Modeling Aspects and Opinions with a MaxEnt-LDA Hybrid , 2010, EMNLP.

[39]  Louis-Philippe Morency,et al.  Computational Analysis of Persuasiveness in Social Multimedia: A Novel Dataset and Multimodal Prediction Approach , 2014, ICMI.

[40]  Louis-Philippe Morency,et al.  UR-FUNNY: A Multimodal Language Dataset for Understanding Humor , 2019, EMNLP.

[41]  Joseph Zambreno,et al.  Polarity Trend Analysis of Public Sentiment on YouTube , 2013, COMAD.

[42]  Stefan Wermter,et al.  Automatically augmenting an emotion dataset improves classification using audio , 2017, EACL.

[43]  Erik Cambria,et al.  Context-Dependent Sentiment Analysis in User-Generated Videos , 2017, ACL.

[44]  Jun Zhao,et al.  Joint Opinion Relation Detection Using One-Class Deep Neural Network , 2014, COLING.

[45]  Khalil Sima'an,et al.  A Shared Task on Multimodal Machine Translation and Crosslingual Image Description , 2016, WMT.

[46]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[47]  Akihiko Ohsuga,et al.  NicoScene: Video Scene Search by Keywords Based on Social Annotation , 2010, AMT.

[48]  Iryna Gurevych,et al.  Multimodal Grounding for Language Processing , 2018, COLING.

[49]  Xiaocheng Feng,et al.  Effective LSTMs for Target-Dependent Sentiment Classification , 2015, COLING.

[50]  Pushpak Bhattacharyya,et al.  Multi-task Learning for Multi-modal Emotion Recognition and Sentiment Analysis , 2019, NAACL.

[51]  Yue Zhang,et al.  NCRF++: An Open-source Neural Sequence Labeling Toolkit , 2018, ACL.

[52]  Daniel Moreira,et al.  Getting the subtext without the text: Scalable multimodal sentiment classification from visual and acoustic modalities , 2018, ArXiv.

[53]  Kate Saenko,et al.  Multilevel Language and Vision Integration for Text-to-Clip Retrieval , 2018, AAAI.