Multimodal sentiment analysis with word-level fusion and reinforcement learning

With the increasing popularity of video sharing websites such as YouTube and Facebook, multimodal sentiment analysis has received increasing attention from the scientific community. Contrary to previous works in multimodal sentiment analysis which focus on holistic information in speech segments such as bag of words representations and average facial expression intensity, we propose a novel deep architecture for multimodal sentiment analysis that is able to perform modality fusion at the word level. In this paper, we propose the Gated Multimodal Embedding LSTM with Temporal Attention (GME-LSTM(A)) model that is composed of 2 modules. The Gated Multimodal Embedding allows us to alleviate the difficulties of fusion when there are noisy modalities. The LSTM with Temporal Attention can perform word level fusion at a finer fusion resolution between the input modalities and attends to the most important time steps. As a result, the GME-LSTM(A) is able to better model the multimodal structure of speech through time and perform better sentiment comprehension. We demonstrate the effectiveness of this approach on the publicly-available Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis (CMU-MOSI) dataset by achieving state-of-the-art sentiment classification and regression results. Qualitative analysis on our model emphasizes the importance of the Temporal Attention Layer in sentiment prediction because the additional acoustic and visual modalities are noisy. We also demonstrate the effectiveness of the Gated Multimodal Embedding in selectively filtering these noisy modalities out. These results and analysis open new areas in the study of sentiment analysis in human communication and provide new models for multimodal fusion.

[1]  Maite Taboada,et al.  Lexicon-Based Methods for Sentiment Analysis , 2011, CL.

[2]  Erik Cambria,et al.  Recent Trends in Deep Learning Based Natural Language Processing , 2017, IEEE Comput. Intell. Mag..

[3]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  G. Miller Learning to Forget , 2004, Science.

[5]  Louis-Philippe Morency,et al.  Adolescent Suicidal Risk Assessment in Clinician-Patient Interaction , 2017, IEEE Transactions on Affective Computing.

[6]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[7]  Hal Daumé,et al.  Deep Unordered Composition Rivals Syntactic Methods for Text Classification , 2015, ACL.

[8]  Fabien Ringeval,et al.  AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge , 2016, AVEC@ACM Multimedia.

[9]  Louis-Philippe Morency,et al.  Deep multimodal fusion for persuasiveness prediction , 2016, ICMI.

[10]  Björn W. Schuller,et al.  YouTube Movie Reviews: Sentiment Analysis in an Audio-Visual Context , 2013, IEEE Intelligent Systems.

[11]  Rada Mihalcea,et al.  Towards multimodal sentiment analysis: harvesting opinions from the web , 2011, ICMI '11.

[12]  Phil Blunsom,et al.  A Convolutional Neural Network for Modelling Sentences , 2014, ACL.

[13]  Louis-Philippe Morency,et al.  Computational Analysis of Persuasiveness in Social Multimedia: A Novel Dataset and Multimodal Prediction Approach , 2014, ICMI.

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  Li Zhao,et al.  Attention-based LSTM for Aspect-level Sentiment Classification , 2016, EMNLP.

[16]  Khalil Sima'an,et al.  A Shared Task on Multimodal Machine Translation and Crosslingual Image Description , 2016, WMT.

[17]  Louis-Philippe Morency,et al.  Verbal Behaviors and Persuasiveness in Online Multimedia Content , 2014, SocialNLP@COLING.

[18]  Navneet Kaur,et al.  Opinion mining and sentiment analysis , 2016, 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom).

[19]  Eric P. Xing,et al.  Select-Additive Learning: Improving Cross-individual Generalization in Multimodal Sentiment Analysis , 2016, ArXiv.

[20]  Louis-Philippe Morency,et al.  Combining Two Perspectives on Classifying Multimodal Data for Recognizing Speaker Traits , 2015, ICMI.

[21]  Erik Cambria,et al.  Deep Convolutional Neural Network Textual Features and Multiple Kernel Learning for Utterance-level Multimodal Sentiment Analysis , 2015, EMNLP.

[22]  Erik Cambria,et al.  A review of affective computing: From unimodal analysis to multimodal fusion , 2017, Inf. Fusion.

[23]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[24]  Eric P. Xing,et al.  Select-additive learning: Improving generalization in multimodal sentiment analysis , 2016, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[25]  Peter Robinson,et al.  OpenFace: An open source facial behavior analysis toolkit , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[26]  Mei-Chen Yeh,et al.  Fast Human Detection Using a Cascade of Histograms of Oriented Gradients , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[27]  Namita Mittal,et al.  Concept-Level Sentiment Analysis with Dependency-Based Semantic Parsing: A Novel Approach , 2015, Cognitive Computation.

[28]  Quoc V. Le,et al.  Neural Architecture Search with Reinforcement Learning , 2016, ICLR.

[29]  Alexander F. Gelbukh,et al.  Dependency-Based Semantic Parsing for Concept-Level Text Analysis , 2014, CICLing.

[30]  Takeo Kanade,et al.  Facial Expression Analysis , 2011, AMFG.

[31]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[32]  Zhou Yu,et al.  Multimodal Prediction of Psychological Disorders: Learning Verbal and Nonverbal Commonalities in Adjacency Pairs , 2013 .

[33]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[34]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Claire Cardie,et al.  Extracting Opinion Expressions with semi-Markov Conditional Random Fields , 2012, EMNLP.

[36]  Erik Cambria,et al.  Sentic LDA: Improving on LDA with semantic similarity for aspect-based sentiment analysis , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[37]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[38]  Louis-Philippe Morency,et al.  MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos , 2016, ArXiv.

[39]  Dipankar Das,et al.  Fuzzy Clustering for Semi-supervised Learning - Case Study: Construction of an Emotion Lexicon , 2012, MICAI.

[40]  Verónica Pérez-Rosas,et al.  Utterance-Level Multimodal Sentiment Analysis , 2013, ACL.

[41]  Fernando De la Torre,et al.  Facial Expression Analysis , 2011, Visual Analysis of Humans.

[42]  Erik Cambria,et al.  Ensemble application of convolutional neural networks and multiple kernel learning for multimodal sentiment analysis , 2017, Neurocomputing.

[43]  John Kane,et al.  COVAREP — A collaborative voice analysis repository for speech technologies , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Albert A. Rizzo,et al.  Self-Reported Symptoms of Depression and PTSD Are Associated with Reduced Vowel Space in Screening Interviews , 2016, IEEE Transactions on Affective Computing.

[45]  Erik Cambria,et al.  Deep Learning-Based Document Modeling for Personality Detection from Text , 2017, IEEE Intelligent Systems.

[46]  Louis-Philippe Morency,et al.  Convolutional Experts Constrained Local Model for Facial Landmark Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[47]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[49]  Louis-Philippe Morency,et al.  Multimodal Sentiment Intensity Analysis in Videos: Facial Gestures and Verbal Messages , 2016, IEEE Intelligent Systems.