FiLMing Multimodal Sarcasm Detection with Attention

Sarcasm detection identifies natural language expressions whose intended meaning is different from what is implied by its surface meaning. It finds applications in many NLP tasks such as opinion mining, sentiment analysis, etc. Today, social media has given rise to an abundant amount of multimodal data where users express their opinions through text and images. Our paper aims to leverage multimodal data to improve the performance of the existing systems for sarcasm detection. So far, various approaches have been proposed that uses text and image modality and a fusion of both. We propose a novel architecture that uses the RoBERTa model with a co-attention layer on top to incorporate context incongruity between input text and image attributes. Further, we integrate feature-wise affine transformation by conditioning the input image through FiLMed ResNet blocks with the textual features using the GRU network to capture the multimodal information. The output from both the models and the CLS token from RoBERTa is concatenated and used for the final prediction. Our results demonstrate that our proposed model outperforms the existing state-of-the-art method by 6.14% F1 score on the public Twitter multimodal sarcasm detection dataset.

[1]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[2]  Anna Rumshisky,et al.  A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.

[3]  Pushpak Bhattacharyya,et al.  Learning Cognitive Features from Gaze Data for Sentiment and Sarcasm Classification using Convolutional Neural Network , 2017, ACL.

[4]  Hongbo Zhu,et al.  Sarcasm Detection with Self-matching Networks and Low-rank Bilinear Pooling , 2019, WWW.

[5]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[6]  Pushpak Bhattacharyya,et al.  Harnessing Context Incongruity for Sarcasm Detection , 2015, ACL.

[7]  Ellen Riloff,et al.  Sarcasm as Contrast between a Positive Sentiment and Negative Situation , 2013, EMNLP.

[8]  Hongbo Wang,et al.  Building a Bridge: A Method for Image-Text Sarcasm Detection Without Pretraining on Image-Text Data , 2020, NLPBT.

[9]  Hongliang Pan,et al.  Modeling Intra and Inter-modality Incongruity for Multi-Modal Sarcasm Detection , 2020, FINDINGS.

[10]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[11]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Zhixiong Zeng,et al.  Reasoning with Multimodal Sarcastic Tweets via Modeling Cross-Modality Contrast and Semantic Association , 2020, ACL.

[13]  Xiaojun Wan,et al.  Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model , 2019, ACL.

[14]  Asif Ekbal,et al.  I didn’t mean what I wrote! Exploring Multimodality for Sarcasm Detection , 2020, 2020 International Joint Conference on Neural Networks (IJCNN).

[15]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[16]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[17]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[18]  Rossano Schifanella,et al.  Detecting Sarcasm in Multimodal Social Platforms , 2016, ACM Multimedia.

[19]  Jian Su,et al.  Reasoning with Sarcasm by Reading In-Between , 2018, ACL.