Context-aware Interactive Attention for Multi-modal Sentiment and Emotion Analysis

In recent times, multi-modal analysis has been an emerging and highly sought-after field at the intersection of natural language processing, computer vision, and speech processing. The prime objective of such studies is to leverage the diversified information, (e.g., textual, acoustic and visual), for learning a model. The effective interaction among these modalities often leads to a better system in terms of performance. In this paper, we introduce a recurrent neural network based approach for the multi-modal sentiment and emotion analysis. The proposed model learns the inter-modal interaction among the participating modalities through an auto-encoder mechanism. We employ a context-aware attention module to exploit the correspondence among the neighboring utterances. We evaluate our proposed approach for five standard multi-modal affect analysis datasets. Experimental results suggest the efficacy of the proposed model for both sentiment and emotion analysis over various existing state-of-the-art systems.

[1]  Ruslan Salakhutdinov,et al.  Gated-Attention Readers for Text Comprehension , 2016, ACL.

[2]  Daniel Moreira,et al.  Getting the subtext without the text: Scalable multimodal sentiment classification from visual and acoustic modalities , 2018, ArXiv.

[3]  Erik Cambria,et al.  Context-Dependent Sentiment Analysis in User-Generated Videos , 2017, ACL.

[4]  Fabio A. González,et al.  Gated Multimodal Units for Information Fusion , 2017, ICLR.

[5]  Erik Cambria,et al.  Ensemble application of convolutional neural networks and multiple kernel learning for multimodal sentiment analysis , 2017, Neurocomputing.

[6]  Verónica Pérez-Rosas,et al.  Utterance-Level Multimodal Sentiment Analysis , 2013, ACL.

[7]  Roland Göcke,et al.  Extending Long Short-Term Memory for Multi-View Structured Learning , 2016, ECCV.

[8]  Louis-Philippe Morency,et al.  Multimodal Sentiment Intensity Analysis in Videos: Facial Gestures and Verbal Messages , 2016, IEEE Intelligent Systems.

[9]  Erik Cambria,et al.  Tensor Fusion Network for Multimodal Sentiment Analysis , 2017, EMNLP.

[10]  Björn W. Schuller,et al.  YouTube Movie Reviews: Sentiment Analysis in an Audio-Visual Context , 2013, IEEE Intelligent Systems.

[11]  Rada Mihalcea,et al.  Towards multimodal sentiment analysis: harvesting opinions from the web , 2011, ICMI '11.

[12]  Roger Zimmermann,et al.  Self-Attentive Feature-Level Fusion for Multimodal Emotion Detection , 2018, 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR).

[13]  Erik Cambria,et al.  A review of affective computing: From unimodal analysis to multimodal fusion , 2017, Inf. Fusion.

[14]  Louis-Philippe Morency,et al.  Combating Human Trafficking with Multimodal Deep Models , 2017, ACL.

[15]  Yongzhao Zhan,et al.  Multimodal shared features learning for emotion recognition by enhanced sparse local discriminative canonical correlation analysis , 2017, Multimedia Systems.

[16]  Pushpak Bhattacharyya,et al.  Contextual Inter-modal Attention for Multi-modal Sentiment Analysis , 2018, EMNLP.

[17]  Rada Mihalcea,et al.  Multimodal Sentiment Analysis , 2012, WASSA@ACL.

[18]  Erik Cambria,et al.  Convolutional MKL Based Multimodal Emotion Recognition and Sentiment Analysis , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[19]  Amol S. Patwardhan,et al.  Multimodal mixed emotion detection , 2017, 2017 2nd International Conference on Communication and Electronics Systems (ICCES).

[20]  Erik Cambria,et al.  Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph , 2018, ACL.

[21]  Pushpak Bhattacharyya,et al.  Multi-task Learning for Multi-modal Emotion Recognition and Sentiment Analysis , 2019, NAACL.

[22]  Louis-Philippe Morency,et al.  Deep multimodal fusion for persuasiveness prediction , 2016, ICMI.

[23]  Ruslan Salakhutdinov,et al.  Learning Factorized Multimodal Representations , 2018, ICLR.

[24]  Erik Cambria,et al.  Memory Fusion Network for Multi-view Sequential Learning , 2018, AAAI.

[25]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[26]  Erik Cambria,et al.  Multi-attention Recurrent Network for Human Communication Comprehension , 2018, AAAI.

[27]  Chan Woo Lee,et al.  Convolutional Attention Networks for Multimodal Emotion Recognition from Speech and Text Data , 2018, ArXiv.

[28]  Sethuraman Panchanathan,et al.  Multimodal emotion recognition using deep learning architectures , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[29]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[30]  George Trigeorgis,et al.  End-to-End Multimodal Emotion Recognition Using Deep Neural Networks , 2017, IEEE Journal of Selected Topics in Signal Processing.