Getting the subtext without the text: Scalable multimodal sentiment classification from visual and acoustic modalities

In the last decade, video blogs (vlogs) have become an extremely popular method through which people express sentiment. The ubiquitousness of these videos has increased the importance of multimodal fusion models, which incorporate video and audio features with traditional text features for automatic sentiment detection. Multimodal fusion offers a unique opportunity to build models that learn from the full depth of expression available to human viewers. In the detection of sentiment in these videos, acoustic and video features provide clarity to otherwise ambiguous transcripts. In this paper, we present a multimodal fusion model that exclusively uses high-level video and audio features to analyze spoken sentences for sentiment. We discard traditional transcription features in order to minimize human intervention and to maximize the deployability of our model on at-scale real-world data. We select high-level features for our model that have been successful in nonaffect domains in order to test their generalizability in the sentiment detection domain. We train and test our model on the newly released CMU Multimodal Opinion Sentiment and Emotion Intensity (CMUMOSEI) dataset, obtaining an F1 score of 0.8049 on the validation set and an F1 score of 0.6325 on the held-out challenge test set.

[1]  Mohammad Soleymani,et al.  A survey of multimodal sentiment analysis , 2017, Image Vis. Comput..

[2]  Verónica Pérez-Rosas,et al.  Utterance-Level Multimodal Sentiment Analysis , 2013, ACL.

[3]  Yi Pan,et al.  Conversational AI: The Science Behind the Alexa Prize , 2018, ArXiv.

[4]  Andrew Olney,et al.  Semi-Automatic Detection of Teacher Questions from Human-Transcripts of Audio in Live Classrooms , 2016, EDM.

[5]  Percy Liang,et al.  Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.

[6]  Erik Cambria,et al.  Tensor Fusion Network for Multimodal Sentiment Analysis , 2017, EMNLP.

[7]  Björn W. Schuller,et al.  YouTube Movie Reviews: Sentiment Analysis in an Audio-Visual Context , 2013, IEEE Intelligent Systems.

[8]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[9]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[10]  Louis-Philippe Morency,et al.  Computational Analysis of Persuasiveness in Social Multimedia: A Novel Dataset and Multimodal Prediction Approach , 2014, ICMI.

[11]  Erik Cambria,et al.  Multi-attention Recurrent Network for Human Communication Comprehension , 2018, AAAI.

[12]  Erik Cambria,et al.  Context-Dependent Sentiment Analysis in User-Generated Videos , 2017, ACL.

[13]  Sen Wang,et al.  Multimodal sentiment analysis with word-level fusion and reinforcement learning , 2017, ICMI.

[14]  Louis-Philippe Morency,et al.  Multimodal Sentiment Intensity Analysis in Videos: Facial Gestures and Verbal Messages , 2016, IEEE Intelligent Systems.

[15]  Vanessa Testoni,et al.  Multimodal data fusion for sensitive scene localization , 2019, Inf. Fusion.

[16]  Andreas Stolcke,et al.  Comparing Human and Machine Errors in Conversational Speech Transcription , 2017, INTERSPEECH.

[17]  Anton Leuski,et al.  Which ASR should I choose for my dialogue system? , 2013, SIGDIAL Conference.

[18]  Vanessa Testoni,et al.  Pornography classification: The hidden clues in video space-time. , 2016, Forensic science international.

[19]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[20]  Luc Van Gool,et al.  Speeded-Up Robust Features (SURF) , 2008, Comput. Vis. Image Underst..

[21]  Jongwook Choi,et al.  End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Pushpak Bhattacharyya,et al.  Automatic Sarcasm Detection , 2016, ACM Comput. Surv..

[23]  Erik Cambria,et al.  A review of affective computing: From unimodal analysis to multimodal fusion , 2017, Inf. Fusion.

[24]  Erik Cambria,et al.  Memory Fusion Network for Multi-view Sequential Learning , 2018, AAAI.

[25]  Erik Cambria,et al.  Multimodal Sentiment Analysis: Addressing Key Issues and Setting Up the Baselines , 2018, IEEE Intelligent Systems.

[26]  Andrew Olney,et al.  Words matter: automatic detection of teacher questions in live classroom discourse using linguistics, acoustics, and context , 2017, LAK.

[27]  Arun Ross,et al.  Information fusion in biometrics , 2003, Pattern Recognit. Lett..

[28]  Andrew Olney,et al.  A Study of Automatic Speech Recognition in Noisy Classroom Environments for Automated Dialog Analysis , 2015, AIED.

[29]  Tong Zhang,et al.  Multi-clue fusion for emotion recognition in the wild , 2016, ICMI.

[30]  Andreas Stolcke,et al.  Dialogue act modeling for automatic tagging and recognition of conversational speech , 2000, CL.

[31]  Erik Cambria,et al.  Deep Convolutional Neural Network Textual Features and Multiple Kernel Learning for Utterance-level Multimodal Sentiment Analysis , 2015, EMNLP.