A Deep Multi-task Contextual Attention Framework for Multi-modal Affect Analysis

Multi-modal affect analysis (e.g., sentiment and emotion analysis) is an interdisciplinary study and has been an emerging and prominent field in Natural Language Processing and Computer Vision. The effective fusion of multiple modalities (e.g., text, acoustic, or visual frames) is a non-trivial task, as these modalities, often, carry distinct and diverse information, and do not contribute equally. The issue further escalates when these data contain noise. In this article, we study the concept of multi-task learning for multi-modal affect analysis and explore a contextual inter-modal attention framework that aims to leverage the association among the neighboring utterances and their multi-modal information. In general, sentiments and emotions have inter-dependence on each other (e.g., anger → negative or happy → positive). In our current work, we exploit the relatedness among the participating tasks in the multi-task framework. We define three different multi-task setups, each having two tasks, i.e., sentiment 8 emotion classification, sentiment classification 8 sentiment intensity prediction, and emotion classificati on 8 emotion intensity prediction. Our evaluation of the proposed system on the CMU-Multi-modal Opinion Sentiment and Emotion Intensity benchmark dataset suggests that, in comparison with the single-task learning framework, our multi-task framework yields better performance for the inter-related participating tasks. Further, comparative studies show that our proposed approach attains state-of-the-art performance for most of the cases.

[1]  Mahdi Eftekhari,et al.  A Multimodal Emotion Recognition System Using Facial Landmark Analysis , 2018, Iranian Journal of Science and Technology, Transactions of Electrical Engineering.

[2]  Karl Zipser,et al.  MultiNet: Multi-Modal Multi-Task Learning for Autonomous Driving , 2017, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[3]  Bo Pang,et al.  Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , 2005, ACL.

[4]  Björn W. Schuller,et al.  Snore Sound Classification Using Image-Based Deep Spectrum Features , 2017, INTERSPEECH.

[5]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Chan Woo Lee,et al.  Convolutional Attention Networks for Multimodal Emotion Recognition from Speech and Text Data , 2018, ArXiv.

[7]  Pushpak Bhattacharyya,et al.  Multi-task Gated Contextual Cross-Modal Attention Framework for Sentiment and Emotion Analysis , 2019, ICONIP.

[8]  Roland Göcke,et al.  Extending Long Short-Term Memory for Multi-View Structured Learning , 2016, ECCV.

[9]  Peng Zhang,et al.  A quantum-inspired multimodal sentiment analysis framework , 2018, Theor. Comput. Sci..

[10]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[11]  Sen Wang,et al.  Multimodal sentiment analysis with word-level fusion and reinforcement learning , 2017, ICMI.

[12]  Anton van den Hengel,et al.  Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Amol S. Patwardhan,et al.  Multimodal mixed emotion detection , 2017, 2017 2nd International Conference on Communication and Electronics Systems (ICCES).

[14]  Erik Cambria,et al.  Context-Dependent Sentiment Analysis in User-Generated Videos , 2017, ACL.

[15]  Erik Cambria,et al.  Ensemble application of convolutional neural networks and multiple kernel learning for multimodal sentiment analysis , 2017, Neurocomputing.

[16]  Erik Cambria,et al.  Multi-attention Recurrent Network for Human Communication Comprehension , 2018, AAAI.

[17]  John Kane,et al.  COVAREP — A collaborative voice analysis repository for speech technologies , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Eric P. Xing,et al.  Select-additive learning: Improving generalization in multimodal sentiment analysis , 2016, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[19]  Pushpak Bhattacharyya,et al.  Context-aware Interactive Attention for Multi-modal Sentiment and Emotion Analysis , 2019, EMNLP.

[20]  Björn W. Schuller,et al.  Multimodal Bag-of-Words for Cross Domains Sentiment Analysis , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Ruslan Salakhutdinov,et al.  Gated-Attention Readers for Text Comprehension , 2016, ACL.

[22]  Erik Cambria,et al.  Multi-level Multiple Attentions for Contextual Multimodal Sentiment Analysis , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[23]  Jiebo Luo,et al.  Cross-modality Consistent Regression for Joint Visual-Textual Sentiment Analysis of Social Multimedia , 2016, WSDM.

[24]  Sunil Kumar Kopparapu,et al.  Sentiment Analysis using Imperfect Views from Spoken Language and Acoustic Modalities , 2018 .

[25]  Daniel Moreira,et al.  Getting the subtext without the text: Scalable multimodal sentiment classification from visual and acoustic modalities , 2018, ArXiv.

[26]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Björn W. Schuller,et al.  openXBOW - Introducing the Passau Open-Source Crossmodal Bag-of-Words Toolkit , 2016, J. Mach. Learn. Res..

[28]  Erik Cambria,et al.  Memory Fusion Network for Multi-view Sequential Learning , 2018, AAAI.

[29]  Erik Cambria,et al.  Aspect extraction for opinion mining with a deep convolutional neural network , 2016, Knowl. Based Syst..

[30]  Brahim Chaib-draa,et al.  Multimodal Multitask Emotion Recognition using Images, Texts and Tags , 2019, Proceedings of the ACM Workshop on Crossmodal Learning and Application - WCRML '19.

[31]  Peter Robinson,et al.  OpenFace: An open source facial behavior analysis toolkit , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[32]  Erik Cambria,et al.  Tensor Fusion Network for Multimodal Sentiment Analysis , 2017, EMNLP.

[33]  Rada Mihalcea,et al.  Towards multimodal sentiment analysis: harvesting opinions from the web , 2011, ICMI '11.

[34]  Sethuraman Panchanathan,et al.  Multimodal emotion recognition using deep learning architectures , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[35]  Georgios Balikas,et al.  Multitask Learning for Fine-Grained Twitter Sentiment Analysis , 2017, SIGIR.

[36]  Ruslan Salakhutdinov,et al.  Embodied Multimodal Multitask Learning , 2019, IJCAI.

[37]  Erik Cambria,et al.  Speaker-Independent Multimodal Sentiment Analysis for Big Data , 2019, Multimodal Analytics for Next-Generation Big Data Technologies and Applications.

[38]  Louis-Philippe Morency,et al.  Multimodal Sentiment Intensity Analysis in Videos: Facial Gestures and Verbal Messages , 2016, IEEE Intelligent Systems.

[39]  Mark Cieliebak,et al.  Sentiment Analysis using Convolutional Neural Networks with Multi-Task Training and Distant Supervision on Italian Tweets , 2019 .

[40]  Erik Cambria,et al.  Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph , 2018, ACL.

[41]  Pushpak Bhattacharyya,et al.  Multi-task Learning for Multi-modal Emotion Recognition and Sentiment Analysis , 2019, NAACL.

[42]  Raymond W. M. Ng,et al.  Multi-Modal Sequence Fusion via Recursive Attention for Emotion Recognition , 2018, CoNLL.

[43]  Louis-Philippe Morency,et al.  Deep multimodal fusion for persuasiveness prediction , 2016, ICMI.

[44]  Erik Cambria,et al.  Convolutional MKL Based Multimodal Emotion Recognition and Sentiment Analysis , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[45]  Rui Xia,et al.  Multimodal Relational Tensor Network for Sentiment and Emotion Classification , 2018, ArXiv.

[46]  Yang Liu,et al.  A Multi-Task Learning Framework for Emotion Recognition Using 2D Continuous Space , 2017, IEEE Transactions on Affective Computing.

[47]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[48]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[49]  Jennifer Williams,et al.  Recognizing Emotions in Video Using Multimodal DNN Feature Fusion , 2018 .

[50]  Louis-Philippe Morency,et al.  Combating Human Trafficking with Multimodal Deep Models , 2017, ACL.

[51]  Roger Zimmermann,et al.  Self-Attentive Feature-Level Fusion for Multimodal Emotion Detection , 2018, 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR).

[52]  Erik Cambria,et al.  A review of affective computing: From unimodal analysis to multimodal fusion , 2017, Inf. Fusion.

[53]  Wenji Mao,et al.  MultiSentiNet: A Deep Semantic Network for Multimodal Sentiment Analysis , 2017, CIKM.

[54]  Yongzhao Zhan,et al.  Multimodal shared features learning for emotion recognition by enhanced sparse local discriminative canonical correlation analysis , 2017, Multimedia Systems.

[55]  Pushpak Bhattacharyya,et al.  Contextual Inter-modal Attention for Multi-modal Sentiment Analysis , 2018, EMNLP.

[56]  Bertram E. Shi,et al.  Multimodal Utterance-level Affect Analysis using Visual, Audio and Text Features , 2018, ArXiv.

[57]  Chung-Hsien Wu,et al.  Speech Emotion Recognition Using Deep Neural Network Considering Verbal and Nonverbal Speech Sounds , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).