Hatemoji: A Test Suite and Adversarially-Generated Dataset for Benchmarking and Detecting Emoji-based Hate

Detecting online hate is a complex task, and low-performing models have harmful consequences when used for sensitive applications such as content moderation. Emoji-based hate is a key emerging challenge for automated detection. We present HATEMOJICHECK, a test suite of 3,930 short-form statements that allows us to evaluate performance on hateful language expressed with emoji. Using the test suite, we expose weaknesses in existing hate detection models. To address these weaknesses, we create the HATEMOJITRAIN dataset using a human-and-model-in-the-loop approach. Models trained on these 5,912 adversarial examples perform substantially better at detecting emoji-based hate, while retaining strong performance on text-only hate. Both HATEMOJICHECK and HATEMOJITRAIN are made publicly available.

[1]  Nathan Srebro,et al.  Equality of Opportunity in Supervised Learning , 2016, NIPS.

[2]  Joachim Bingel,et al.  Bridging the Gaps: Multi Task Learning for Domain Transfer of Hate Speech Detection , 2018 .

[3]  Dong Nguyen,et al.  HateCheck: Functional Tests for Hate Speech Detection Models , 2021, ACL/IJCNLP.

[4]  Marília Prada,et al.  Lisbon Emoji and Emoticon Database (LEED): Norms for emoji and emoticons in seven evaluative dimensions , 2017, Behavior Research Methods.

[5]  Iyad Rahwan,et al.  Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm , 2017, EMNLP.

[6]  Michael Wiegand,et al.  Exploiting Emojis for Abusive Language Detection , 2021, EACL.

[7]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[8]  Scott A. Hale,et al.  Challenges and frontiers in abusive content detection , 2019, Proceedings of the Third Workshop on Abusive Language Online.

[9]  Thomas Mensink,et al.  New Modality: Emoji Challenges in Prediction, Anticipation, and Retrieval , 2018, IEEE Transactions on Multimedia.

[10]  Douwe Kiela,et al.  Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection , 2021, Annual Meeting of the Association for Computational Linguistics.

[11]  Jason Weston,et al.  Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack , 2019, EMNLP.

[12]  Zhiyi Ma,et al.  Dynabench: Rethinking Benchmarking in NLP , 2021, NAACL.

[13]  Fabrício Benevenuto,et al.  Analyzing the Targets of Hate in Online Social Media , 2016, ICWSM.

[14]  Serena Villata,et al.  Hybrid Emoji-Based Masked Language Models for Zero-Shot Abusive Language Detection , 2020, FINDINGS.

[15]  Gianluca Stringhini,et al.  Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior , 2018, ICWSM.

[16]  Preslav Nakov,et al.  SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval) , 2019, *SEMEVAL.

[17]  Tommaso Caselli,et al.  I Feel Offended, Don’t Be Abusive! Implicit/Explicit Messages in Offensive and Abusive Language , 2020, LREC.

[18]  Jianfeng Gao,et al.  DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.

[19]  Dong Nguyen,et al.  Introducing CAD: the Contextual Abuse Dataset , 2021, NAACL.

[20]  Dirk Hovy,et al.  Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter , 2016, NAACL.

[21]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[22]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[23]  Quoc V. Le,et al.  A Simple Method for Commonsense Reasoning , 2018, ArXiv.

[24]  K. Gelber Hate Speech - Definitions & Empirical Evidence , 2017 .

[25]  Indra Budi,et al.  Identification of hate speech and abusive language on indonesian Twitter using the Word2vec, part of speech and emoji features , 2019, AISS '19.

[26]  Dat Quoc Nguyen,et al.  BERTweet: A pre-trained language model for English Tweets , 2020, EMNLP.

[27]  A. Strauss,et al.  Grounded theory , 2017 .

[28]  Thamar Solorio,et al.  Attending the Emotions to Detect Online Abusive Language , 2020, ALW.

[29]  John Langford,et al.  A Reductions Approach to Fair Classification , 2018, ICML.

[30]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[31]  Hao Tan,et al.  The Curse of Performance Instability in Analysis Datasets: Consequences, Source, and Suggestions , 2020, EMNLP.

[32]  Ingmar Weber,et al.  Automated Hate Speech Detection and the Problem of Offensive Language , 2017, ICWSM.

[33]  Sameer Singh,et al.  Beyond Accuracy: Behavioral Testing of NLP Models with CheckList , 2020, ACL.

[34]  Emily M. Bender,et al.  Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science , 2018, TACL.

[35]  Christopher Potts,et al.  DynaSent: A Dynamic Benchmark for Sentiment Analysis , 2020, ACL.

[36]  A. Bax “The C‐Word” Meets “the N‐Word”: The Slur‐Once‐Removed and the Discursive Construction of “Reverse Racism” , 2018, Journal of Linguistic Anthropology.

[37]  Thomas Wolf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[38]  Justus J. Randolph Free-Marginal Multirater Kappa (multirater K[free]): An Alternative to Fleiss' Fixed-Marginal Multirater Kappa. , 2005 .

[39]  Tarleton Gillespie,et al.  Content moderation, AI, and the question of scale , 2020, Big Data Soc..

[40]  Noah A. Smith,et al.  Evaluating Models’ Local Decision Boundaries via Contrast Sets , 2020, FINDINGS.

[41]  Lucy Vasserman,et al.  Measuring and Mitigating Unintended Bias in Text Classification , 2018, AIES.

[42]  Sebastian Riedel,et al.  Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension , 2020, Transactions of the Association for Computational Linguistics.