Select-Additive Learning: Improving Cross-individual Generalization in Multimodal Sentiment Analysis

Multimodal sentiment analysis is drawing an increasing amount of attention these days. It enables mining of opinions in video reviews and surveys which are now available aplenty on online platforms like YouTube. However, the limited number of high-quality multimodal sentiment data samples may introduce the problem of the sentiment being dependent on the individual specific features in the dataset. This results in a lack of generalizability of the trained models for classification on larger online platforms. In this paper, we first examine the data and verify the existence of this dependence problem. Then we propose a Select-Additive Learning (SAL) procedure that improves the generalizability of trained discriminative neural networks. SAL is a two-phase learning method. In Selection phase, it selects the confounding learned representation. In Addition phase, it forces the classifier to discard confounded representations by adding Gaussian noise. In our experiments, we show how SAL improves the generalizability of state-of-the-art models. We increase prediction accuracy significantly in all three modalities (text, audio, video), as well as in their fusion. We show how SAL, even when trained on one dataset, achieves good accuracy across test datasets.

[1]  Michael Siebers,et al.  On the Relevance of Sequence Information for Decoding Facial Expressions of Pain and Disgust An Avatar Study , 2013 .

[2]  Akshi Kumar,et al.  Sentiment Analysis: A Perspective on its Past, Present and Future , 2012 .

[3]  Loïc Kessous,et al.  Multimodal emotion recognition from expressive faces, body gestures and speech , 2007, AIAI.

[4]  Janyce Wiebe,et al.  Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis , 2005, HLT.

[5]  Peter Robinson,et al.  3D Constrained Local Model for rigid and non-rigid facial tracking , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Erik Cambria,et al.  Deep Convolutional Neural Network Textual Features and Multiple Kernel Learning for Utterance-level Multimodal Sentiment Analysis , 2015, EMNLP.

[7]  C. McCulloch,et al.  Generalized Linear Mixed Models , 2005 .

[8]  Tyler J. VanderWeele,et al.  On the definition of a confounder , 2013, Annals of statistics.

[9]  Ruslan Salakhutdinov,et al.  Importance Weighted Autoencoders , 2015, ICLR.

[10]  Amir Zadeh,et al.  Micro-opinion Sentiment Intensity Analysis and Summarization in Online Videos , 2015, ICMI.

[11]  Francesco Colace,et al.  "Magic Mirror in my Hand, what is the Sentiment in the Lens?": an Action Unit based Approach for Mining Sentiments from Multimedia Contents , 2015, DMS.

[12]  Ellen Riloff,et al.  Learning Extraction Patterns for Subjective Expressions , 2003, EMNLP.

[13]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[14]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[15]  Souraya Ezzat,et al.  Sentiment Analysis of Call Centre Audio Conversations using Text Classification , 2022 .

[16]  Verónica Pérez-Rosas,et al.  Utterance-Level Multimodal Sentiment Analysis , 2013, ACL.

[17]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[18]  Weihong Deng,et al.  One-shot deep neural network for pose and illumination normalization face recognition , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[19]  Haohan Wang,et al.  Multiple Confounders Correction with Regularized Linear Mixed Effect Models, with Application in Biological Processes , 2016, bioRxiv.

[20]  Erik Cambria,et al.  Fusing audio, visual and textual clues for sentiment analysis from multimodal content , 2016, Neurocomputing.

[21]  Erik Cambria,et al.  SenticNet 3: A Common and Common-Sense Knowledge Base for Cognition-Driven Sentiment Analysis , 2014, AAAI.

[22]  Björn W. Schuller,et al.  YouTube Movie Reviews: Sentiment Analysis in an Audio-Visual Context , 2013, IEEE Intelligent Systems.

[23]  Rada Mihalcea,et al.  Towards multimodal sentiment analysis: harvesting opinions from the web , 2011, ICMI '11.

[24]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[25]  John H. L. Hansen,et al.  Sentiment extraction from natural audio streams , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[26]  Janyce Wiebe,et al.  Learning Subjective Adjectives from Corpora , 2000, AAAI/IAAI.

[27]  Katherine B. Martin,et al.  Facial Action Coding System , 2015 .

[28]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[29]  Weihong Deng,et al.  Weakly-supervised deep self-learning for face recognition , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[30]  Jian Zhang,et al.  Recognizing human actions from low-resolution videos by region-based mixture models , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[31]  Bhiksha Raj,et al.  On the Origin of Deep Learning , 2017, ArXiv.

[32]  Lillian Lee,et al.  Opinion Mining and Sentiment Analysis , 2008, Found. Trends Inf. Retr..

[33]  Anna Esposito,et al.  A Speaker Independent Approach to the Classification of Emotional Vocal Expressions , 2008, 2008 20th IEEE International Conference on Tools with Artificial Intelligence.

[34]  Verónica Pérez-Rosas,et al.  Multimodal Sentiment Analysis of Spanish Online Videos , 2013, IEEE Intelligent Systems.

[35]  K. Robert Lai,et al.  Dimensional Sentiment Analysis Using a Regional CNN-LSTM Model , 2016, ACL.

[36]  Shigeo Morishima,et al.  Expression analysis/synthesis system based on emotion space constructed by multilayered neural network , 1994 .

[37]  R. Didham,et al.  Confounding factors in the detection of species responses to habitat fragmentation , 2005, Biological reviews of the Cambridge Philosophical Society.