Mmlatch: Bottom-Up Top-Down Fusion For Multimodal Sentiment Analysis

Current deep learning approaches for multimodal fusion rely on bottom-up fusion of high and mid-level latent modality representations (late/mid fusion) or low level sensory inputs (early fusion). Models of human perception highlight the importance of top-down fusion, where high-level representations affect the way sensory inputs are perceived, i.e. cognition affects perception. These top-down interactions are not captured in current deep learning models. In this work we propose a neural architecture that captures top-down crossmodal interactions, using a feedback mechanism in the forward pass during network training. The proposed mechanism extracts high-level representations for each modality and uses these representations to mask the sensory inputs, allowing the model to perform top-down feature masking. We apply the proposed model for multimodal sentiment recognition on CMU-MOSEI. Our method shows consistent improvements over the well established MulT and over our strong late fusion baseline, achieving state-of-the-art results.

[1]  Erik Cambria,et al.  Tensor Fusion Network for Multimodal Sentiment Analysis , 2017, EMNLP.

[2]  Erik Cambria,et al.  Multi-attention Recurrent Network for Human Communication Comprehension , 2018, AAAI.

[3]  Karl M. Newell,et al.  Visual feedback during motor performance is associated with increased complexity and adaptability of motor and neural output , 2019, Behavioural Brain Research.

[4]  Carlos Busso,et al.  Emotion recognition using a hierarchical binary decision tree approach , 2011, Speech Commun..

[5]  C. Teufel,et al.  How to (and how not to) think about top-down influences on visual perception , 2017, Consciousness and Cognition.

[6]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[7]  Christopher D. Chambers,et al.  Current perspectives and methods in studying neural mechanisms of multisensory interactions , 2012, Neuroscience & Biobehavioral Reviews.

[8]  S. Manita,et al.  A Top-Down Cortical Circuit for Accurate Sensory Perception , 2015, Neuron.

[9]  Zhongkai Sun,et al.  Multi-modal Sentiment Analysis using Deep Canonical Correlation Analysis , 2019, INTERSPEECH.

[10]  Drew H. Abney,et al.  Journal of Experimental Psychology : Human Perception and Performance Influence of Musical Groove on Postural Sway , 2015 .

[11]  Ashish Sardana,et al.  Multilogue-Net: A Context-Aware RNN for Multi-modal Emotion Detection and Sentiment Analysis in Conversation , 2020, CHALLENGEHML.

[12]  Barnabás Póczos,et al.  Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities , 2018, AAAI.

[13]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[14]  Björn W. Schuller,et al.  LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework , 2013, Image Vis. Comput..

[15]  Georgios Paraskevopoulos,et al.  Multimodal and Multiresolution Speech Recognition with Transformers , 2020, ACL.

[16]  Soumya Sourav,et al.  Lightweight Models for Multimodal Sequential Data , 2021, WASSA.

[17]  Louis-Philippe Morency,et al.  Efficient Low-rank Multimodal Fusion With Modality-Specific Factors , 2018, ACL.

[18]  Matthew H. Davis,et al.  Predictive Top-Down Integration of Prior Knowledge during Speech Perception , 2012, The Journal of Neuroscience.

[19]  Alexandros Potamianos,et al.  Deep Hierarchical Fusion with Application in Sentiment Analysis , 2019, INTERSPEECH.

[20]  Rohit Kumar,et al.  Ensemble of SVM trees for multimodal emotion recognition , 2012, Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference.

[21]  Louis-Philippe Morency,et al.  Integrating Multimodal Information in Large Pretrained Transformers , 2020, ACL.

[22]  Louis-Philippe Morency,et al.  Words Can Shift: Dynamically Adjusting Word Representations Using Nonverbal Behaviors , 2018, AAAI.

[23]  Erik Cambria,et al.  Convolutional MKL Based Multimodal Emotion Recognition and Sentiment Analysis , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[24]  Edward F Chang,et al.  The cortical computations underlying feedback control in vocal production , 2015, Current Opinion in Neurobiology.

[25]  B. Scholl,et al.  'Top-down' effects where none should be found: The El Greco fallacy in perception research , 2013 .

[26]  Lucia Specia,et al.  Probing the Need for Visual Context in Multimodal Machine Translation , 2019, NAACL.

[27]  Erik Cambria,et al.  Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph , 2018, ACL.

[28]  Geoffrey E. Hinton,et al.  Dynamic Routing Between Capsules , 2017, NIPS.

[29]  S. Dupont,et al.  A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis , 2020, CHALLENGEHML.

[30]  Yao-Hung Hubert Tsai,et al.  Multimodal Routing: Improving Local and Global Interpretability of Multimodal Language Analysis , 2020, EMNLP.

[31]  Björn W. Schuller,et al.  Context-Sensitive Learning for Enhanced Audiovisual Emotion Classification , 2012, IEEE Transactions on Affective Computing.

[32]  Ivan Marsic,et al.  Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-Level Alignment , 2018, ACL.

[33]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[35]  Ruslan Salakhutdinov,et al.  Multimodal Transformer for Unaligned Multimodal Language Sequences , 2019, ACL.

[36]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Shaodi You,et al.  Cross-modal context-gated convolution for multi-modal sentiment analysis , 2021, Pattern Recognit. Lett..

[38]  Jithendra Vepa,et al.  Gated Mechanism for Attention Based Multi Modal Sentiment Analysis , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Shinsuke Shimojo,et al.  Development of multisensory spatial integration and perception in humans. , 2006, Developmental science.

[40]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[41]  Gary Lupyan,et al.  Objective Effects of Knowledge on Visual Perception , 2016, Journal of experimental psychology. Human perception and performance.

[42]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[43]  Desmond Elliott,et al.  Multimodal Speech Recognition with Unstructured Audio Masking , 2020, NLPBT.

[44]  Rich Gossweiler,et al.  Perceiving geographical slant , 1995, Psychonomic bulletin & review.