论文信息 - The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes

The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes

This work proposes a new challenge set for multimodal classification, focusing on detecting hate speech in multimodal memes. It is constructed such that unimodal models struggle and only multimodal models can succeed: difficult examples ("benign confounders") are added to the dataset to make it hard to rely on unimodal signals. The task requires subtle reasoning, yet is straightforward to evaluate as a binary classification problem. We provide baseline performance numbers for unimodal models, as well as for multimodal models with various degrees of sophistication. We find that state-of-the-art methods perform poorly compared to humans (64.73% vs. 84.7% accuracy), illustrating the difficulty of the task and highlighting the challenge that this important problem poses to the community.

[1] Weiguo Fan,et al. Effects of user-provided photos on hotel review helpfulness: An analytical approach with deep leaning , 2018 .

[2] Li Fei-Fei,et al. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Donald Geman,et al. Visual Turing test for computer vision systems , 2015, Proceedings of the National Academy of Sciences.

[4] Douwe Kiela,et al. Supervised Multimodal Bitransformers for Classifying Images and Text , 2019, ViGIL@NeurIPS.

[5] Albert Gatt,et al. Grounded Textual Entailment , 2018, COLING.

[6] P. Alam. ‘K’ , 2021, Composites Engineering.

[7] Jianfei Yu,et al. Adapting BERT for Target-Oriented Multimodal Sentiment Classification , 2019, IJCAI.

[8] Ross B. Girshick,et al. Mask R-CNN , 2017, 1703.06870.

[9] Christopher D. Manning,et al. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Gunhee Kim,et al. Expressing an Image Stream with a Sequence of Natural Sentences , 2015, NIPS.

[11] R'emi Louf,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[12] José M. F. Moura,et al. Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Shivakant Mishra,et al. Prediction of Cyberbullying Incidents on the Instagram Social Network , 2015, ArXiv.

[14] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Yuandong Tian,et al. Simple Baseline for Visual Question Answering , 2015, ArXiv.

[16] Arnau Ramisa,et al. Multimodal News Article Analysis , 2017, IJCAI.

[17] Vivek K. Singh,et al. Toward Multimodal Cyberbullying Detection , 2017, CHI Extended Abstracts.

[18] Gianluca Stringhini,et al. Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior , 2018, ICWSM.

[19] Firoj Alam,et al. CrisisMMD: Multimodal Twitter Datasets from Natural Disasters , 2018, ICWSM.

[20] Denis Yarats,et al. On the adequacy of untuned warmup for adaptive optimization , 2019, AAAI.

[21] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Desmond Elliott,et al. Adversarial Evaluation of Multimodal Machine Translation , 2018, EMNLP.

[23] Marcus Rohrbach,et al. TextCaps: a Dataset for Image Captioning with Reading Comprehension , 2020, ECCV.

[24] David A. Shamma,et al. YFCC100M , 2015, Commun. ACM.

[25] Fabio A. González,et al. Gated Multimodal Units for Information Fusion , 2017, ICLR.

[26] Cody Buntain,et al. A Large Labeled Corpus for Online Harassment Research , 2017, WebSci.

[27] Allan Jabri,et al. Revisiting Visual Question Answering Baselines , 2016, ECCV.

[28] Asim Kadav,et al. Visual Entailment: A Novel Task for Fine-Grained Image Understanding , 2019, ArXiv.

[29] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[30] Saurabh Gupta,et al. Exploring Nearest Neighbor Approaches for Image Captioning , 2015, ArXiv.

[31] Zeerak Waseem,et al. Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter , 2016, NLP+CSS@EMNLP.

[32] Ritesh Kumar,et al. Benchmarking Aggression Identification in Social Media , 2018, TRAC@COLING 2018.

[33] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34] Vedanuj Goswami,et al. Are we pretraining it right? Digging deeper into visio-linguistic pretraining , 2020, ArXiv.

[35] Peter Young,et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[36] Lucy Vasserman,et al. Measuring and Mitigating Unintended Bias in Text Classification , 2018, AIES.

[37] Andrew P. Bradley,et al. The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[38] Shie Mannor,et al. Is a picture worth a thousand words? A Deep Multi-Modal Fusion Architecture for Product Classification in e-commerce , 2016, AAAI 2016.

[39] Dan Jurafsky,et al. Integrating Text and Image: Determining Multimodal Document Intent in Instagram Posts , 2019, EMNLP.

[40] Lucia Specia,et al. Probing the Need for Visual Context in Multimodal Machine Translation , 2019, NAACL.

[41] Mohammad Soleymani,et al. A survey of multimodal sentiment analysis , 2017, Image Vis. Comput..

[42] Cho-Jui Hsieh,et al. VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[43] Ingmar Weber,et al. Racial Bias in Hate Speech and Abusive Language Detection Datasets , 2019, Proceedings of the Third Workshop on Abusive Language Online.

[44] Stephen Clark,et al. Virtual Embodiment: A Scalable Long-Term Strategy for Artificial Intelligence Research , 2016, ArXiv.

[45] Shervin Malmasi,et al. Detecting Hate Speech in Social Media , 2017, RANLP.

[46] Hugo Larochelle,et al. Interpretable Multi-Modal Hate Speech Detection , 2021, ArXiv.

[47] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[48] Michael Wiegand,et al. A Survey on Hate Speech Detection using Natural Language Processing , 2017, SocialNLP@EACL.

[49] Xinlei Chen,et al. Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[50] Virgílio A. F. Almeida,et al. Characterizing and Detecting Hateful Users on Twitter , 2018, ICWSM.

[51] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[52] Ignazio Gallo,et al. Multimodal Classification Fusion in Real-World Scenarios , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[53] Tomas Mikolov,et al. Bag of Tricks for Efficient Text Classification , 2016, EACL.

[54] Yejin Choi,et al. The Risk of Racial Bias in Hate Speech Detection , 2019, ACL.

[55] D. Gurari,et al. Captioning Images Taken by People Who Are Blind , 2020, ECCV.

[56] Ingmar Weber,et al. Automated Hate Speech Detection and the Problem of Offensive Language , 2017, ICWSM.

[57] Dhruv Batra,et al. Analyzing the Behavior of Visual Question Answering Models , 2016, EMNLP.

[58] G. G. Stokes. "J." , 1890, The New Yale Book of Quotations.

[59] Matthieu Cord,et al. Recipe recognition with large multimodal food dataset , 2015, 2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[60] Francis Ferraro,et al. Visual Storytelling , 2016, NAACL.

[61] Vicente Ordonez,et al. ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.

[62] Khalil Sima'an,et al. A Shared Task on Multimodal Machine Translation and Crosslingual Image Description , 2016, WMT.

[63] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[64] Reut Tsarfaty,et al. Evaluating NLP Models via Contrast Sets , 2020, ArXiv.

[65] Yash Goyal,et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[66] Mingda Zhang,et al. Equal But Not The Same: Understanding the Implicit Relationship Between Persuasive Images and Text , 2018, BMVC.

[67] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[68] David A. Shamma,et al. The New Data and New Challenges in Multimedia Research , 2015, ArXiv.

[69] Jiebo Luo,et al. VizWiz Grand Challenge: Answering Visual Questions from Blind People , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[70] Xinlei Chen,et al. Towards VQA Models That Can Read , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[71] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[72] Jitendra Malik,et al. Habitat: A Platform for Embodied AI Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[73] Radu Soricut,et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[74] Zachary Chase Lipton,et al. Learning the Difference that Makes a Difference with Counterfactually-Augmented Data , 2019, ICLR.

[75] Rémi Louf,et al. Transformers : State-ofthe-art Natural Language Processing , 2019 .

[76] Khalil Sima'an,et al. Multi30K: Multilingual English-German Image Descriptions , 2016, VL@ACL.

[77] Hugo Larochelle,et al. GuessWhat?! Visual Object Discovery through Multi-modal Dialogue , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[78] M. McHugh. Interrater reliability: the kappa statistic , 2012, Biochemia medica.

[79] Shervin Malmasi,et al. Challenges in discriminating profanity from hate speech , 2017, J. Exp. Theor. Artif. Intell..

[80] Xavier Giró-i-Nieto,et al. Hate Speech in Pixels: Detection of Offensive Memes towards Automatic Moderation , 2019, ArXiv.

[81] Ingmar Weber,et al. Understanding Abuse: A Typology of Abusive Language Detection Subtasks , 2017, ALW@ACL.

[82] Suhang Wang,et al. Fake News Detection on Social Media: A Data Mining Perspective , 2017, SKDD.

[83] Yoav Artzi,et al. A Corpus for Reasoning about Natural Language Grounded in Photographs , 2018, ACL.

[84] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[85] Mingda Zhang,et al. Automatic Understanding of Image and Video Advertisements , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[86] Xinlei Chen,et al. Pythia-A platform for vision & language research , 2018 .

[87] Stéphane Dupont,et al. An empirical study on the effectiveness of images in Multimodal Neural Machine Translation , 2017, EMNLP.

[88] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[89] Dirk Hovy,et al. Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter , 2016, NAACL.

[90] Cornelia Caragea,et al. Content-Driven Detection of Cyberbullying on the Instagram Social Network , 2016, IJCAI.

[91] Lluis Gomez,et al. Exploring Hate Speech Detection in Multimodal Publications , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[92] Xiaochang Peng,et al. Exploring Deep Multimodal Fusion of Text and Photo for Hate Speech Classification , 2019, Proceedings of the Third Workshop on Abusive Language Online.

[93] Marco Baroni,et al. Grounding Distributional Semantics in the Visual World , 2016, Lang. Linguistics Compass.

[94] Zhuowen Tu,et al. Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[95] Heng Ji,et al. Visual Attention Model for Name Tagging in Multimodal Social Media , 2018, ACL.

[96] Dietrich Klakow,et al. Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods , 2019, J. Artif. Intell. Res..

[97] Sérgio Nunes,et al. A Survey on Automatic Detection of Hate Speech in Text , 2018, ACM Comput. Surv..