论文信息 - Mmlatch: Bottom-Up Top-Down Fusion For Multimodal Sentiment Analysis

Mmlatch: Bottom-Up Top-Down Fusion For Multimodal Sentiment Analysis

Current deep learning approaches for multimodal fusion rely on bottom-up fusion of high and mid-level latent modality representations (late/mid fusion) or low level sensory inputs (early fusion). Models of human perception highlight the importance of top-down fusion, where high-level representations affect the way sensory inputs are perceived, i.e. cognition affects perception. These top-down interactions are not captured in current deep learning models. In this work we propose a neural architecture that captures top-down crossmodal interactions, using a feedback mechanism in the forward pass during network training. The proposed mechanism extracts high-level representations for each modality and uses these representations to mask the sensory inputs, allowing the model to perform top-down feature masking. We apply the proposed model for multimodal sentiment recognition on CMU-MOSEI. Our method shows consistent improvements over the well established MulT and over our strong late fusion baseline, achieving state-of-the-art results.

[1] Erik Cambria,et al. Tensor Fusion Network for Multimodal Sentiment Analysis , 2017, EMNLP.

[2] Erik Cambria,et al. Multi-attention Recurrent Network for Human Communication Comprehension , 2018, AAAI.

[3] Karl M. Newell,et al. Visual feedback during motor performance is associated with increased complexity and adaptability of motor and neural output , 2019, Behavioural Brain Research.

[4] Carlos Busso,et al. Emotion recognition using a hierarchical binary decision tree approach , 2011, Speech Commun..

[5] C. Teufel,et al. How to (and how not to) think about top-down influences on visual perception , 2017, Consciousness and Cognition.

[6] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[7] Christopher D. Chambers,et al. Current perspectives and methods in studying neural mechanisms of multisensory interactions , 2012, Neuroscience & Biobehavioral Reviews.

[8] S. Manita,et al. A Top-Down Cortical Circuit for Accurate Sensory Perception , 2015, Neuron.

[9] Zhongkai Sun,et al. Multi-modal Sentiment Analysis using Deep Canonical Correlation Analysis , 2019, INTERSPEECH.

[10] Drew H. Abney,et al. Journal of Experimental Psychology : Human Perception and Performance Influence of Musical Groove on Postural Sway , 2015 .

[11] Ashish Sardana,et al. Multilogue-Net: A Context-Aware RNN for Multi-modal Emotion Detection and Sentiment Analysis in Conversation , 2020, CHALLENGEHML.

[12] Barnabás Póczos,et al. Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities , 2018, AAAI.

[13] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[14] Björn W. Schuller,et al. LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework , 2013, Image Vis. Comput..

[15] Georgios Paraskevopoulos,et al. Multimodal and Multiresolution Speech Recognition with Transformers , 2020, ACL.

[16] Soumya Sourav,et al. Lightweight Models for Multimodal Sequential Data , 2021, WASSA.

[17] Louis-Philippe Morency,et al. Efficient Low-rank Multimodal Fusion With Modality-Specific Factors , 2018, ACL.

[18] Matthew H. Davis,et al. Predictive Top-Down Integration of Prior Knowledge during Speech Perception , 2012, The Journal of Neuroscience.

[19] Alexandros Potamianos,et al. Deep Hierarchical Fusion with Application in Sentiment Analysis , 2019, INTERSPEECH.

[20] Rohit Kumar,et al. Ensemble of SVM trees for multimodal emotion recognition , 2012, Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference.

[21] Louis-Philippe Morency,et al. Integrating Multimodal Information in Large Pretrained Transformers , 2020, ACL.

[22] Louis-Philippe Morency,et al. Words Can Shift: Dynamically Adjusting Word Representations Using Nonverbal Behaviors , 2018, AAAI.

[23] Erik Cambria,et al. Convolutional MKL Based Multimodal Emotion Recognition and Sentiment Analysis , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[24] Edward F Chang,et al. The cortical computations underlying feedback control in vocal production , 2015, Current Opinion in Neurobiology.

[25] B. Scholl,et al. 'Top-down' effects where none should be found: The El Greco fallacy in perception research , 2013 .

[26] Lucia Specia,et al. Probing the Need for Visual Context in Multimodal Machine Translation , 2019, NAACL.

[27] Erik Cambria,et al. Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph , 2018, ACL.

[28] Geoffrey E. Hinton,et al. Dynamic Routing Between Capsules , 2017, NIPS.

[29] S. Dupont,et al. A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis , 2020, CHALLENGEHML.

[30] Yao-Hung Hubert Tsai,et al. Multimodal Routing: Improving Local and Global Interpretability of Multimodal Language Analysis , 2020, EMNLP.

[31] Björn W. Schuller,et al. Context-Sensitive Learning for Enhanced Audiovisual Emotion Classification , 2012, IEEE Transactions on Affective Computing.

[32] Ivan Marsic,et al. Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-Level Alignment , 2018, ACL.

[33] Louis-Philippe Morency,et al. Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[35] Ruslan Salakhutdinov,et al. Multimodal Transformer for Unaligned Multimodal Language Sequences , 2019, ACL.

[36] Jiebo Luo,et al. Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37] Shaodi You,et al. Cross-modal context-gated convolution for multi-modal sentiment analysis , 2021, Pattern Recognit. Lett..

[38] Jithendra Vepa,et al. Gated Mechanism for Attention Based Multi Modal Sentiment Analysis , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39] Shinsuke Shimojo,et al. Development of multisensory spatial integration and perception in humans. , 2006, Developmental science.

[40] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[41] Gary Lupyan,et al. Objective Effects of Knowledge on Visual Perception , 2016, Journal of experimental psychology. Human perception and performance.

[42] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[43] Desmond Elliott,et al. Multimodal Speech Recognition with Unstructured Audio Masking , 2020, NLPBT.

[44] Rich Gossweiler,et al. Perceiving geographical slant , 1995, Psychonomic bulletin & review.