What do you MEME? Generating Explanations for Visual Semantic Role Labelling in Memes

Memes are powerful means for effective communication on social media. Their effortless amalgamation of viral visuals and compelling messages can have far-reaching implications with proper marketing. Previous research on memes has primarily focused on characterizing their affective spectrum and detecting whether the meme's message insinuates any intended harm, such as hate, offense, racism, etc. However, memes often use abstraction, which can be elusive. Here, we introduce a novel task - EXCLAIM, generating explanations for visual semantic role labeling in memes. To this end, we curate ExHVV, a novel dataset that offers natural language explanations of connotative roles for three types of entities - heroes, villains, and victims, encompassing 4,680 entities present in 3K memes. We also benchmark ExHVV with several strong unimodal and multimodal baselines. Moreover, we posit LUMEN, a novel multimodal, multi-task learning framework that endeavors to address EXCLAIM optimally by jointly learning to predict the correct semantic roles and correspondingly to generate suitable natural language explanations. LUMEN distinctly outperforms the best baseline across 18 standard natural language generation evaluation metrics. Our systematic evaluation and analyses demonstrate that characteristic multimodal cues required for adjudicating semantic roles are also helpful for generating suitable explanations.

[1]  Md. Shad Akhtar,et al.  DISARM: Detecting the Victims Targeted by Harmful Memes , 2022, NAACL-HLT.

[2]  Giovanni Da San Martino,et al.  Detecting and Understanding Harmful Memes: A Survey , 2022, IJCAI.

[3]  Jingren Zhou,et al.  OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework , 2022, ICML.

[4]  Lu Yuan,et al.  Florence: A New Foundation Model for Computer Vision , 2021, ArXiv.

[5]  Hang Li,et al.  Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts , 2021, ICML.

[6]  Tanmoy Chakraborty,et al.  Detecting Harmful Memes and Their Targets , 2021, FINDINGS.

[7]  Lior Wolf,et al.  Caption Enriched Samples for Improving Hateful Memes Detection , 2021, EMNLP.

[8]  Tanmoy Chakraborty,et al.  MOMENTA: A Multimodal Framework for Detecting Harmful Memes and Their Targets , 2021, EMNLP.

[9]  Firoj Alam,et al.  Detecting Propaganda Techniques in Memes , 2021, ACL.

[10]  Lanyu Shang,et al.  AOMD: An Analogy-aware Approach to Offensive Meme Detection on Social Media , 2021, Inf. Process. Manag..

[11]  P. Kumaraguru,et al.  “Subverting the Jewtocracy”: Online Antisemitism Detection Using Multimodal Deep Learning , 2021, WebSci.

[12]  Sergey Tulyakov,et al.  SMIL: Multimodal Learning with Severely Missing Modality , 2021, AAAI.

[13]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[14]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[15]  Wonjae Kim,et al.  ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision , 2021, ICML.

[16]  Vlad Sandulescu,et al.  Detecting Hateful Memes Using a Multimodal Deep Ensemble , 2020, ArXiv.

[17]  Helen Yannakoudakis,et al.  A Multimodal Framework for the Detection of Hateful Memes , 2020, ArXiv.

[18]  Jewgeni Rose,et al.  Detecting Hate Speech in Memes Using Multimodal Deep Learning Approaches: Prize-winning solution to Hateful Memes Challenge , 2020, ArXiv.

[19]  Niklas Muennighoff,et al.  Vilio: State-of-the-art Visio-Linguistic Models applied to Hateful Memes , 2020, ArXiv.

[20]  Yi Zhou,et al.  Multimodal Learning For Hateful Memes Detection , 2020, 2021 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[21]  Hongliang Pan,et al.  Modeling Intra and Inter-modality Incongruity for Multi-Modal Sarcasm Detection , 2020, FINDINGS.

[22]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[23]  Tanmoy Chakraborty,et al.  SemEval-2020 Task 8: Memotion Analysis- the Visuo-Lingual Metaphor! , 2020, SEMEVAL.

[24]  Yongdong Zhang,et al.  Overcoming Language Priors with Self-supervised Learning for Visual Question Answering , 2020, IJCAI.

[25]  Jianfeng Gao,et al.  DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.

[26]  Douwe Kiela,et al.  The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes , 2020, NeurIPS.

[27]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[28]  Jianlong Fu,et al.  Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers , 2020, ArXiv.

[29]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[30]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[31]  Jason J. Corso,et al.  Unified Vision-Language Pre-Training for Image Captioning and VQA , 2019, AAAI.

[32]  Luke Zettlemoyer,et al.  Don’t Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases , 2019, EMNLP.

[33]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[34]  Cho-Jui Hsieh,et al.  VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[35]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[36]  Ali Farhadi,et al.  OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Ali Farhadi,et al.  From Recognition to Cognition: Visual Commonsense Reasoning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Carl Doersch,et al.  Learning Visual Question Answering by Bootstrapping Hard Attention , 2018, ECCV.

[39]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[41]  Wei Wang,et al.  A Comprehensive Survey on Cross-modal Retrieval , 2016, ArXiv.

[42]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[43]  Qi Wu,et al.  Image Captioning and Visual Question Answering Based on Attributes and External Knowledge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Peng Wang,et al.  Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge from External Sources , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[46]  John A. Bateman,et al.  Text and Image , 2014 .

[47]  Md. Shad Akhtar,et al.  Findings of the CONSTRAINT 2022 Shared Task on Detecting the Hero, the Villain, and the Victim in Memes , 2022, CONSTRAINT.

[48]  Lu Kun,et al.  Logically at the Constraint 2022: Multimodal role labelling , 2022, CONSTRAINT.

[49]  Bharathi Raja Chakravarthi,et al.  Findings of the Shared Task on Troll Meme Classification in Tamil , 2021, DRAVIDIANLANGTECH.

[50]  Chhavi Sharma,et al.  A Curious Case of Meme Detection: An Investigative Study , 2020, WEBIST.

[51]  Chhavi Sharma,et al.  Meme vs. Non-meme Classification using Visuo-linguistic Association , 2020, WEBIST.

[52]  Els Lefever,et al.  LT3 at SemEval-2020 Task 8: Multi-Modal Multi-Task Learning for Memotion Analysis , 2020, SEMEVAL.

[53]  Sabino Miranda-Jiménez,et al.  Infotec + CentroGEO at SemEval-2020 Task 8: Deep Learning and Text Categorization approach for Memes classification , 2020, SEMEVAL.

[54]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[55]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[56]  S. Hirsch Image Music Text , 2016 .