Computational Modeling of Human Multimodal Language : The MOSEI Dataset and Interpretable Dynamic Fusion

Computational modeling of human multimodal language is an emerging research area in natural language processing spanning the language, visual and acoustic modalities. Comprehending multimodal language requires not only the modeling of interactions within each modality (intramodal interactions), but more importantly the interactions between modalities (crossmodal interactions). Modeling these interactions lie at the core of multimodal language analysis. From a resource perspective, there is a genuine need for large scale datasets that allow for in-depth studies of human multimodal language. In this paper we introduce Multimodal Opinion Sentiment and Emotion Intensity (MOSEI), the largest dataset for multimodal sentiment analysis and emotion recognition. In addition, we propose a novel multimodal fusion technique called the Graph Memory Fusion Network (GMFN) that dynamically fuses modalities in a hierarchical manner. Using data from MOSEI and GMFN, we conduct experiments to investigate the hierarchical interactions between modalities in human multimodal language. Unlike previously proposed fusion techniques, GMFN is highly interpretable and achieves superior performance when compared to the previous state of the art, demonstrating that GMFN is highly suitable for multimodal language analysis.

[1]  Ruslan Salakhutdinov,et al.  Learning Factorized Multimodal Representations , 2018, ICLR.

[2]  Louis-Philippe Morency,et al.  Multimodal Language Analysis with Recurrent Multistage Fusion , 2018, EMNLP.

[3]  Louis-Philippe Morency,et al.  Multimodal Local-Global Ranking Fusion for Emotion Recognition , 2018, ICMI.

[4]  Barnabás Póczos,et al.  Seq2Seq2Sentiment: Multimodal Sequence to Sequence Models for Sentiment Analysis , 2018, ArXiv.

[5]  Louis-Philippe Morency,et al.  Efficient Low-rank Multimodal Fusion With Modality-Specific Factors , 2018, ACL.

[6]  Erik Cambria,et al.  Memory Fusion Network for Multi-view Sequential Learning , 2018, AAAI.

[7]  Erik Cambria,et al.  Multi-attention Recurrent Network for Human Communication Comprehension , 2018, AAAI.

[8]  Motoki Taniguchi,et al.  Character-based Bidirectional LSTM-CRF with words and characters for Japanese Named Entity Recognition , 2017, SWCN@EMNLP.

[9]  Erik Cambria,et al.  Tensor Fusion Network for Multimodal Sentiment Analysis , 2017, EMNLP.

[10]  Erik Cambria,et al.  Context-Dependent Sentiment Analysis in User-Generated Videos , 2017, ACL.

[11]  Louis-Philippe Morency,et al.  Combating Human Trafficking with Multimodal Deep Models , 2017, ACL.

[12]  Bhiksha Raj,et al.  SphereFace: Deep Hypersphere Embedding for Face Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Jürgen Schmidhuber,et al.  Recurrent Highway Networks , 2016, ICML.

[14]  Erik Cambria,et al.  Convolutional MKL Based Multimodal Emotion Recognition and Sentiment Analysis , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[15]  Wootaek Lim,et al.  Speech emotion recognition using convolutional and Recurrent Neural Networks , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[16]  Suryakanth V. Gangashetty,et al.  Multimodal Sentiment Analysis Using Deep Neural Networks , 2016, MIKE.

[17]  Louis-Philippe Morency,et al.  Multimodal Sentiment Intensity Analysis in Videos: Facial Gestures and Verbal Messages , 2016, IEEE Intelligent Systems.

[18]  Louis-Philippe Morency,et al.  Deep multimodal fusion for persuasiveness prediction , 2016, ICMI.

[19]  Roland Göcke,et al.  Extending Long Short-Term Memory for Multi-View Structured Learning , 2016, ECCV.

[20]  Eric P. Xing,et al.  Select-Additive Learning: Improving Cross-individual Generalization in Multimodal Sentiment Analysis , 2016, ArXiv.

[21]  Louis-Philippe Morency,et al.  MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos , 2016, ArXiv.

[22]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[23]  Yu Qiao,et al.  Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks , 2016, IEEE Signal Processing Letters.

[24]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Peter Robinson,et al.  OpenFace: An open source facial behavior analysis toolkit , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[26]  Zhiyuan Liu,et al.  A C-LSTM Neural Network for Text Classification , 2015, ArXiv.

[27]  Anton Nijholt,et al.  The MAHNOB Mimicry Database: A database of naturalistic human interactions , 2015, Pattern Recognit. Lett..

[28]  Tamás D. Gedeon,et al.  Video and Image based Emotion Recognition Challenges in the Wild: EmotiW 2015 , 2015, ICMI.

[29]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[30]  Jürgen Schmidhuber,et al.  Training Very Deep Networks , 2015, NIPS.

[31]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Louis-Philippe Morency,et al.  Computational Analysis of Persuasiveness in Social Multimedia: A Novel Dataset and Multimodal Prediction Approach , 2014, ICMI.

[33]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[34]  Glen Coppersmith,et al.  Dynamic Wordclouds and Vennclouds for Exploratory Data Analysis , 2014 .

[35]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[37]  John Kane,et al.  COVAREP — A collaborative voice analysis repository for speech technologies , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Phil Blunsom,et al.  A Convolutional Neural Network for Modelling Sentences , 2014, ACL.

[39]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[40]  Yale Song,et al.  Action Recognition by Hierarchical Sequence Summarization , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  John Kane,et al.  Wavelet Maxima Dispersion for Breathy to Tense Voice Discrimination , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[42]  Björn W. Schuller,et al.  YouTube Movie Reviews: Sentiment Analysis in an Audio-Visual Context , 2013, IEEE Intelligent Systems.

[43]  Fabien Ringeval,et al.  Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[44]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[45]  Verónica Pérez-Rosas,et al.  Utterance-Level Multimodal Sentiment Analysis , 2013, ACL.

[46]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[47]  Tamás D. Gedeon,et al.  Collecting Large, Richly Annotated Facial-Expression Databases from Movies , 2012, IEEE MultiMedia.

[48]  Yale Song,et al.  Multi-view latent variable discriminative models for action recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Patrick A. Naylor,et al.  Detection of Glottal Closure Instants From Speech Signals: A Quantitative Review , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[50]  Abeer Alwan,et al.  Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics , 2019, INTERSPEECH.

[51]  Rada Mihalcea,et al.  Towards multimodal sentiment analysis: harvesting opinions from the web , 2011, ICMI '11.

[52]  Yoshua Bengio,et al.  Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach , 2011, ICML.

[53]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[54]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[55]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[56]  Shrikanth S. Narayanan,et al.  The Vera am Mittag German audio-visual emotional speech database , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[57]  Mark Liberman,et al.  Speaker identification on the SCOTUS corpus , 2008 .

[58]  Trevor Darrell,et al.  Hidden Conditional Random Fields , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[59]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[60]  P. Alku,et al.  Normalized amplitude quotient for parametrization of the glottal flow. , 2002, The Journal of the Acoustical Society of America.

[61]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[62]  F. Max Müller,et al.  Lectures on the Science of Language: Delivered at the Royal Institution of Great Britain in April, May, & June, 1861 , 2001 .

[63]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[64]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[65]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[66]  Paavo Alku,et al.  Parabolic spectral parameter - A new method for quantification of the glottal flow , 1997, Speech Commun..

[67]  P. Ekman,et al.  Facial signs of emotional experience. , 1980 .

[68]  L. Baum,et al.  Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .