What makes the difference? An empirical comparison of fusion strategies for multimodal language analysis

Abstract Multimodal video sentiment analysis is a rapidly growing area. It combines verbal (i.e., linguistic) and non-verbal modalities (i.e., visual, acoustic) to predict the sentiment of utterances. A recent trend has been geared towards different modality fusion models utilizing various attention, memory and recurrent components. However, there lacks a systematic investigation on how these different components contribute to solving the problem as well as their limitations. This paper aims to fill the gap, marking the following key innovations. We present the first large-scale and comprehensive empirical comparison of eleven state-of-the-art (SOTA) modality fusion approaches in two video sentiment analysis tasks, with three SOTA benchmark corpora. An in-depth analysis of the results shows that the attention mechanisms are the most effective for modelling crossmodal interactions, yet they are computationally expensive. Second, additional levels of crossmodal interaction decrease performance. Third, positive sentiment utterances are the most challenging cases for all approaches. Finally, integrating context and utilizing the linguistic modality as a pivot for non-verbal modalities improve performance. We expect that the findings would provide helpful insights and guidance to the development of more effective modality fusion models.

[1]  Frédéric Jurie,et al.  Temporal multimodal fusion for video emotion classification in the wild , 2017, ICMI.

[2]  Xiyuan Zhang,et al.  Proactive Human-Machine Conversation with Explicit Conversation Goal , 2019, ACL.

[3]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[4]  Erik Cambria,et al.  Fuzzy commonsense reasoning for multimodal sentiment analysis , 2019, Pattern Recognit. Lett..

[5]  Ruslan Salakhutdinov,et al.  Strong and Simple Baselines for Multimodal Utterance Embeddings , 2019, NAACL.

[6]  Mark Liberman,et al.  Speaker identification on the SCOTUS corpus , 2008 .

[7]  Daniel Sonntag,et al.  Multimodal Speech-based Dialogue for the Mini-Mental State Examination , 2019, CHI Extended Abstracts.

[8]  Erik Cambria,et al.  Context-Dependent Sentiment Analysis in User-Generated Videos , 2017, ACL.

[9]  Louis-Philippe Morency,et al.  Deep multimodal fusion for persuasiveness prediction , 2016, ICMI.

[10]  Louis-Philippe Morency,et al.  Multimodal Language Analysis with Recurrent Multistage Fusion , 2018, EMNLP.

[11]  Roland Göcke,et al.  Extending Long Short-Term Memory for Multi-View Structured Learning , 2016, ECCV.

[12]  Jun Du,et al.  Deep Fusion: An Attention Guided Factorized Bilinear Pooling for Audio-video Emotion Recognition , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[13]  Erik Cambria,et al.  Affective Computing and Sentiment Analysis , 2016, IEEE Intelligent Systems.

[14]  Yingyu Liang,et al.  Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis , 2019, AAAI.

[15]  Ruslan Salakhutdinov,et al.  Interpretable Multimodal Routing for Human Multimodal Language , 2020, ArXiv.

[16]  Angeliki Lazaridou,et al.  Combining Language and Vision with a Multimodal Skip-gram Model , 2015, NAACL.

[17]  CambriaErik,et al.  A review of affective computing , 2017 .

[18]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[19]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[20]  Zilong Wang,et al.  TransModality: An End2End Fusion Method with Transformer for Multimodal Sentiment Analysis , 2020, WWW.

[21]  Songlong Xing,et al.  Divide, Conquer and Combine: Hierarchical Feature Fusion Network with Local and Global Perspectives for Multimodal Affective Computing , 2019, ACL.

[22]  Louis-Philippe Morency,et al.  Representation Learning for Speech Emotion Recognition , 2016, INTERSPEECH.

[23]  Ruslan Salakhutdinov,et al.  Learning Representations from Imperfect Time Series Data via Tensor Rank Regularization , 2019, ACL.

[24]  Barnabás Póczos,et al.  Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities , 2018, AAAI.

[25]  Erik Cambria,et al.  Conversational Memory Network for Emotion Recognition in Dyadic Dialogue Videos , 2018, NAACL.

[26]  Songlong Xing,et al.  Locally Confined Modality Fusion Network With a Global Perspective for Multimodal Human Affective Computing , 2020, IEEE Transactions on Multimedia.

[27]  Erik Cambria,et al.  A review of affective computing: From unimodal analysis to multimodal fusion , 2017, Inf. Fusion.

[28]  Rada Mihalcea,et al.  MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations , 2018, ACL.

[29]  Sunil Kumar Kopparapu,et al.  Audio-Visual Fusion for Sentiment Classification using Cross-Modal Autoencoder , 2018 .

[30]  Soujanya Poria,et al.  MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis , 2020, ACM Multimedia.

[31]  Ruslan Salakhutdinov,et al.  Multimodal Transformer for Unaligned Multimodal Language Sequences , 2019, ACL.

[32]  Markus Kächele,et al.  Multiple Classifier Systems for the Classification of Audio-Visual Emotional States , 2011, ACII.

[33]  Mohammad Ubaidullah Bokhari,et al.  Multimodal Information Retrieval: Challenges and Future Trends , 2013 .

[34]  Louis-Philippe Morency,et al.  Factorized Multimodal Transformer for Multimodal Sequential Learning , 2019, ArXiv.

[35]  Louis-Philippe Morency,et al.  Words Can Shift: Dynamically Adjusting Word Representations Using Nonverbal Behaviors , 2018, AAAI.

[36]  Erik Cambria,et al.  Convolutional MKL Based Multimodal Emotion Recognition and Sentiment Analysis , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[37]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[39]  John Kane,et al.  COVAREP — A collaborative voice analysis repository for speech technologies , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Erik Cambria,et al.  Memory Fusion Network for Multi-view Sequential Learning , 2018, AAAI.

[41]  Louis-Philippe Morency,et al.  MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos , 2016, ArXiv.

[42]  Erik Cambria,et al.  Dialogue systems with audio context , 2020, Neurocomputing.

[43]  Jean Maillard,et al.  Black Holes and White Rabbits: Metaphor Identification with Visual Features , 2016, NAACL.

[44]  Jane Yung-jen Hsu,et al.  Sentic blending: Scalable multimodal fusion for the continuous interpretation of semantics and sentics , 2013, 2013 IEEE Symposium on Computational Intelligence for Human-like Intelligence (CIHLI).

[45]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Graham W. Taylor,et al.  Deep Multimodal Learning: A Survey on Recent Advances and Trends , 2017, IEEE Signal Processing Magazine.

[47]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[48]  Rada Mihalcea,et al.  Towards multimodal sentiment analysis: harvesting opinions from the web , 2011, ICMI '11.

[49]  Erik Cambria,et al.  Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph , 2018, ACL.

[50]  Jingtao Wang,et al.  Predicting Learners' Emotions in Mobile MOOC Learning via a Multimodal Intelligent Tutor , 2018, ITS.

[51]  Zhen-Tao Liu,et al.  Electroencephalogram Emotion Recognition Based on Empirical Mode Decomposition and Optimal Feature Selection , 2019, IEEE Transactions on Cognitive and Developmental Systems.

[52]  Raymond W. M. Ng,et al.  Multi-Modal Sequence Fusion via Recursive Attention for Emotion Recognition , 2018, CoNLL.

[53]  Erik Cambria,et al.  Tensor Fusion Network for Multimodal Sentiment Analysis , 2017, EMNLP.

[54]  Erik Cambria,et al.  Multi-attention Recurrent Network for Human Communication Comprehension , 2018, AAAI.

[55]  Louis-Philippe Morency,et al.  Multimodal Sentiment Intensity Analysis in Videos: Facial Gestures and Verbal Messages , 2016, IEEE Intelligent Systems.

[56]  Rada Mihalcea,et al.  ICON: Interactive Conversational Memory Network for Multimodal Emotion Detection , 2018, EMNLP.

[57]  Dawei Song,et al.  Quantum-Inspired Interactive Networks for Conversational Sentiment Analysis , 2019, IJCAI.

[58]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[59]  Ivan Marsic,et al.  Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-Level Alignment , 2018, ACL.

[60]  Pascale Fung,et al.  Modality-based Factorization for Multimodal Fusion , 2018, RepL4NLP@ACL.

[61]  Eric P. Xing,et al.  Select-additive learning: Improving generalization in multimodal sentiment analysis , 2016, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[62]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[63]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[64]  Rada Mihalcea,et al.  DialogueRNN: An Attentive RNN for Emotion Detection in Conversations , 2018, AAAI.

[65]  Shiliang Sun,et al.  A survey of multi-view machine learning , 2013, Neural Computing and Applications.

[66]  Mohan S. Kankanhalli,et al.  Multimodal fusion for multimedia analysis: a survey , 2010, Multimedia Systems.

[67]  Belur V. Dasarathy,et al.  Medical Image Fusion: A survey of the state of the art , 2013, Inf. Fusion.

[68]  Sen Wang,et al.  Multimodal sentiment analysis with word-level fusion and reinforcement learning , 2017, ICMI.

[69]  Alexandros Potamianos,et al.  Deep Hierarchical Fusion with Application in Sentiment Analysis , 2019, INTERSPEECH.

[70]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[71]  Louis-Philippe Morency,et al.  Efficient Low-rank Multimodal Fusion With Modality-Specific Factors , 2018, ACL.

[72]  Mandy C. Stigant Divide , 2020, Definitions.

[73]  Pushpak Bhattacharyya,et al.  Contextual Inter-modal Attention for Multi-modal Sentiment Analysis , 2018, EMNLP.

[74]  Haifeng Hu,et al.  Modality to Modality Translation: An Adversarial Representation Learning and Graph Fusion Network for Multimodal Fusion , 2019, AAAI.

[75]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[76]  Sidney K. D'Mello,et al.  A Review and Meta-Analysis of Multimodal Affect Detection Systems , 2015, ACM Comput. Surv..

[77]  Sergio Escalera,et al.  Survey on Emotional Body Gesture Recognition , 2018, IEEE Transactions on Affective Computing.

[78]  Alexander Gelbukh,et al.  DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation , 2019, EMNLP.

[79]  Stéphane Ayache,et al.  Majority Vote of Diverse Classifiers for Late Fusion , 2014, S+SSPR.