M3: MultiModal Masking Applied to Sentiment Analysis

A common issue when training multimodal architectures is that not all modalities contribute equally to the model’s prediction and the network tends to over-rely on the strongest modality. In this work, we presentM, a training procedure based on modality masking for deep multimodal architectures. During network training, we randomly select one modality and mask its features, forcing the model to make its prediction in the absence of this modality. This structured regularization allows the network to better exploit complementary information in input modalities. We implementM as a generic layer that can be integrated with any multimodal architecture. Our experiments show that M outperforms other masking schemes and improves performance for our strong baseline. We evaluate M for multimodal sentiment analysis on CMU-MOSEI, achieving results comparable to the state-of-the-art.

[1]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[2]  Yao-Hung Hubert Tsai,et al.  Multimodal Routing: Improving Local and Global Interpretability of Multimodal Language Analysis , 2020, EMNLP.

[3]  Yingyu Liang,et al.  Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis , 2019, AAAI.

[4]  Lucia Specia,et al.  Probing the Need for Visual Context in Multimodal Machine Translation , 2019, NAACL.

[5]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[6]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[7]  Alexander Gelbukh,et al.  DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation , 2019, EMNLP.

[8]  Tae-Hyun Oh,et al.  Learning to Localize Sound Source in Visual Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  John Kane,et al.  COVAREP — A collaborative voice analysis repository for speech technologies , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Geoffrey E. Hinton,et al.  Dynamic Routing Between Capsules , 2017, NIPS.

[11]  Zhongkai Sun,et al.  Multi-modal Sentiment Analysis using Deep Canonical Correlation Analysis , 2019, INTERSPEECH.

[12]  Shinsuke Shimojo,et al.  Development of multisensory spatial integration and perception in humans. , 2006, Developmental science.

[13]  Louis-Philippe Morency,et al.  Integrating Multimodal Information in Large Pretrained Transformers , 2020, ACL.

[14]  Louis-Philippe Morency,et al.  Words Can Shift: Dynamically Adjusting Word Representations Using Nonverbal Behaviors , 2018, AAAI.

[15]  Du Tran,et al.  What Makes Training Multi-Modal Classification Networks Hard? , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Georgios Paraskevopoulos,et al.  Multimodal and Multiresolution Speech Recognition with Transformers , 2020, ACL.

[17]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[18]  Erik Cambria,et al.  Tensor Fusion Network for Multimodal Sentiment Analysis , 2017, EMNLP.

[19]  Erik Cambria,et al.  Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph , 2018, ACL.

[20]  Alexandros Potamianos,et al.  Deep Hierarchical Fusion with Application in Sentiment Analysis , 2019, INTERSPEECH.

[21]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[22]  Ruslan Salakhutdinov,et al.  Multimodal Transformer for Unaligned Multimodal Language Sequences , 2019, ACL.

[23]  Ivan Marsic,et al.  Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-Level Alignment , 2018, ACL.

[24]  Christopher D. Chambers,et al.  Current perspectives and methods in studying neural mechanisms of multisensory interactions , 2012, Neuroscience & Biobehavioral Reviews.

[25]  Shamane Siriwardhana,et al.  Jointly Fine-Tuning "BERT-like" Self Supervised Models to Improve Multimodal Speech Emotion Recognition , 2020, INTERSPEECH.

[26]  Barnabás Póczos,et al.  Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities , 2018, AAAI.

[27]  Joon Son Chung,et al.  ASR is All You Need: Cross-Modal Distillation for Lip Reading , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[30]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.