Multimodal Multitask Emotion Recognition using Images, Texts and Tags

Recently, multimodal emotion recognition received an increasing interest due to its potential to improve performance by leveraging complementary sources of information. In this work, we explore the use of images, texts and tags for emotion recognition. However, using several modalities can also come with an additional challenge that is often ignored, namely the problem of "missing modality". Social media users do not always publish content containing an image, text and tags, and consequently one or two modalities are often missing at test time. Similarly, the labeled training data that contain all modalities can be limited. Taking this in consideration, we propose a multimodal model that leverages a multitask framework to enable the use of training data composed of an arbitrary number of modality, while it can also perform predictions with missing modalities. We show that our approach is robust to one or two missing modalities at test time. Also, with this framework it becomes easy to fine-tune some parts of our model with unimodal and bimodal training data, which can further improve overall performance. Finally, our experiments support that this multitask learning also acts as a regularization mechanism that improves generalization.

[1]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[2]  Shin'ichi Satoh,et al.  Image sentiment analysis using latent correlations among visual, textual, and sentiment views , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Mario Cannataro,et al.  Sentiment analysis and affective computing for depression monitoring , 2017, 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[4]  Qingming Huang,et al.  Affective Image Content Analysis: A Comprehensive Survey , 2018, IJCAI.

[5]  Johannes Wagner,et al.  Building a Robust System for Multimodal Emotion Recognition , 2015 .

[6]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[8]  Rada Mihalcea,et al.  Multimodal Sentiment Analysis , 2012, WASSA@ACL.

[9]  Rongrong Ji,et al.  Large-scale visual sentiment ontology and detectors using adjective noun pairs , 2013, ACM Multimedia.

[10]  Sebastian Ruder,et al.  An Overview of Multi-Task Learning in Deep Neural Networks , 2017, ArXiv.

[11]  Yifei Lu,et al.  Combining Eye Movements and EEG to Enhance Emotion Recognition , 2015, IJCAI.

[12]  Yue Gao,et al.  Exploring Principles-of-Art Features For Image Emotion Recognition , 2014, ACM Multimedia.

[13]  A. Hanjalic,et al.  Extracting moods from pictures and sounds: towards truly personalized TV , 2006, IEEE Signal Processing Magazine.

[14]  Jussara M. Almeida,et al.  A survey on tag recommendation methods , 2017, J. Assoc. Inf. Sci. Technol..

[15]  Shuai Xu,et al.  Joint Visual-Textual Sentiment Analysis Based on Cross-Modality Attention Mechanism , 2018, MMM.

[16]  Frédéric Jurie,et al.  CentralNet: a Multilayer Approach for Multimodal Fusion , 2018, ECCV Workshops.

[17]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Markus Kächele,et al.  Multiple Classifier Systems for the Classification of Audio-Visual Emotional States , 2011, ACII.

[19]  Chong-Wah Ngo,et al.  Deep Multimodal Learning for Affective Analysis and Retrieval , 2015, IEEE Transactions on Multimedia.

[20]  Yue Gao,et al.  Continuous Probability Distribution Prediction of Image Emotions via Multitask Shared Sparse Regression , 2017, IEEE Transactions on Multimedia.

[21]  Maurizio Morisio,et al.  Music Mood Dataset Creation Based on Last.fm Tags , 2017 .

[22]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[23]  Erik Cambria,et al.  A review of affective computing: From unimodal analysis to multimodal fusion , 2017, Inf. Fusion.

[24]  Yunhong Wang,et al.  Visual and textual sentiment analysis using deep fusion convolutional neural networks , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[25]  Jiebo Luo,et al.  Building a Large Scale Dataset for Image Emotion Recognition: The Fine Print and The Benchmark , 2016, AAAI.

[26]  Erik Cambria,et al.  Convolutional MKL Based Multimodal Emotion Recognition and Sentiment Analysis , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[27]  P. Valdez,et al.  Effects of color on emotions. , 1994, Journal of experimental psychology. General.

[28]  Brahim Chaib-draa,et al.  Multimodal Sentiment Analysis: A Multitask Learning Approach , 2019, ICPRAM.

[29]  LazebnikSvetlana,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2014 .

[30]  M. Bar,et al.  Humans Prefer Curved Visual Objects , 2006, Psychological science.

[31]  Silvia Corchs,et al.  Ensemble learning on visual and textual data for social image emotion classification , 2017, International Journal of Machine Learning and Cybernetics.

[32]  Karl Aberer,et al.  Multimodal Classification for Analysing Social Media , 2017, ArXiv.

[33]  Dit-Yan Yeung,et al.  Relational Stacked Denoising Autoencoder for Tag Recommendation , 2015, AAAI.

[34]  Fangzhao Wu,et al.  Personalized Microblog Sentiment Classification via Multi-Task Learning , 2016, AAAI.

[35]  Qiang Yang,et al.  An Overview of Multi-task Learning , 2018 .

[36]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.