ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection

Toxic language detection systems often falsely flag text that contains minority group mentions as toxic, as those groups are often the targets of online hate. Such over-reliance on spurious correlations also causes systems to struggle with detecting implicitly toxic language.To help mitigate these issues, we create ToxiGen, a new large-scale and machine-generated dataset of 274k toxic and benign statements about 13 minority groups. We develop a demonstration-based prompting framework and an adversarial classifier-in-the-loop decoding method to generate subtly toxic and benign text with a massive pretrained language model. Controlling machine generation in this way allows ToxiGen to cover implicitly toxic text at a larger scale, and about more demographic groups, than previous resources of human-written text. We conduct a human evaluation on a challenging subset of ToxiGen and find that annotators struggle to distinguish machine-generated text from human-written language. We also find that 94.5% of toxic examples are labeled as hate speech by human annotators. Using three publicly-available datasets, we show that finetuning a toxicity classifier on our data improves its performance on human-written data substantially. We also demonstrate that ToxiGen can be used to fight machine-generated toxicity as finetuning improves the classifier significantly on our evaluation subset.

[1]  J. Weinstein,et al.  System Error: Where Big Tech Went Wrong and How We Can Reboot , 2022, Perspectives on Science and Christian Faith.

[2]  Noah A. Smith,et al.  Annotators with Attitudes: How Annotator Beliefs And Identities Bias Toxic Language Detection , 2021, NAACL.

[3]  Vinodkumar Prabhakaran,et al.  Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations , 2021, TACL.

[4]  Alessio Palmero Aprosio,et al.  Agreeing to Disagree: Annotating Offensive Language Datasets with Annotators’ Disagreement , 2021, EMNLP.

[5]  Mai ElSherief,et al.  Latent Hatred: A Benchmark for Understanding Implicit Hate Speech , 2021, EMNLP.

[6]  Tal August,et al.  All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text , 2021, ACL.

[7]  Josef Ruppenhofer,et al.  Implicitly Abusive Language – What does it actually look like and why are we not getting there? , 2021, NAACL.

[8]  Dan Klein,et al.  Detoxifying Language Models Risks Marginalizing Minority Voices , 2021, NAACL.

[9]  D. Klein,et al.  FUDGE: Controlled Text Generation With Future Discriminators , 2021, NAACL.

[10]  Stella Biderman,et al.  GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow , 2021 .

[11]  Yejin Choi,et al.  Challenges in Automated Debiasing for Toxic Language Detection , 2021, EACL.

[12]  Douwe Kiela,et al.  Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection , 2021, Annual Meeting of the Association for Computational Linguistics.

[13]  Danqi Chen,et al.  Making Pre-trained Language Models Better Few-shot Learners , 2021, ACL.

[14]  J. Pierrehumbert,et al.  HateCheck: Functional Tests for Hate Speech Detection Models , 2020, ACL.

[15]  Yejin Choi,et al.  NeuroLogic Decoding: (Un)supervised Neural Text Generation with Predicate Logic Constraints , 2020, NAACL.

[16]  Tommaso Caselli,et al.  HateBERT: Retraining BERT for Abusive Language Detection in English , 2020, WOAH.

[17]  Shafiq R. Joty,et al.  GeDi: Generative Discriminator Guided Sequence Generation , 2020, EMNLP.

[18]  Sumit Kumar,et al.  TweetBLM: A Hate Speech Dataset and Analysis of Black Lives Matter-related Microblogs on Twitter , 2020, ArXiv.

[19]  Noah A. Smith,et al.  Scarecrow: A Framework for Scrutinizing Machine Text , 2021, ArXiv.

[20]  Daniel Khashabi,et al.  Natural Instructions: Benchmarking Generalization to New Tasks from Natural Language Instructions , 2021, ArXiv.

[21]  Thiago Dias Oliva,et al.  Fighting Hate Speech, Silencing Drag Queens? Artificial Intelligence in Content Moderation and Risks to LGBTQ Voices Online , 2020, Sexuality & Culture.

[22]  Sarah T. Roberts,et al.  Expanding the debate about content moderation: Scholarly research agendas for the coming policy debates , 2020, Internet Policy Rev..

[23]  Yejin Choi,et al.  RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , 2020, FINDINGS.

[24]  Olatunji Ruwase,et al.  DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters , 2020, KDD.

[25]  Carolina Are How Instagram’s algorithm is censoring women and vulnerable users but helping online abusers , 2020, Feminist Media Studies.

[26]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[27]  Noah A. Smith,et al.  Social Bias Frames: Reasoning about Social and Power Implications of Language , 2019, ACL.

[28]  Yulia Tsvetkov,et al.  Fortifying Toxic Speech Detectors Against Veiled Toxicity , 2020, EMNLP.

[29]  Emily Ahn,et al.  Finding Microaggressions in the Wild: A Case for Locating Elusive Phenomena in Social Media Posts , 2019, EMNLP.

[30]  Nanyun Peng,et al.  The Woman Worked as a Babysitter: On Biases in Language Generation , 2019, EMNLP.

[31]  Nazli Goharian,et al.  Hate speech detection: Challenges and solutions , 2019, PloS one.

[32]  Jason Weston,et al.  Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack , 2019, EMNLP.

[33]  Aida Mostafazadeh Davani,et al.  Bound in Hatred: The role of group-based morality in acts of hate , 2019 .

[34]  Yejin Choi,et al.  The Risk of Racial Bias in Hate Speech Detection , 2019, ACL.

[35]  Preslav Nakov,et al.  Predicting the Type and Target of Offensive Posts in Social Media , 2019, NAACL.

[36]  Lucy Vasserman,et al.  Measuring and Mitigating Unintended Bias in Text Classification , 2018, AIES.

[37]  Ona de Gibert,et al.  Hate Speech Dataset from a White Supremacy Forum , 2018, ALW.

[38]  Aida Mostafazadeh Davani,et al.  The Gab Hate Corpus: A collection of 27k posts annotated for hate speech , 2018 .

[39]  Yejin Choi,et al.  Learning to Write with Cooperative Discriminators , 2018, ACL.

[40]  Yann Dauphin,et al.  Hierarchical Neural Story Generation , 2018, ACL.

[41]  Kush R. Varshney,et al.  The Effect of Extremist Violence on Hateful Speech Online , 2018, ICWSM.

[42]  Mikolaj Winiewski,et al.  Exposure to hate speech increases prejudice through desensitization , 2018, Aggressive behavior.

[43]  Gianluca Stringhini,et al.  Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior , 2018, ICWSM.

[44]  K. Nadal Microaggressions and Traumatic Stress: Theory, Research, and Clinical Treatment , 2018 .

[45]  Monnica T. Williams,et al.  A Preliminary Report on the Relationship Between Microaggressions Against Black People and Racism Among White College Students , 2017 .

[46]  Qun Liu,et al.  Lexically Constrained Decoding for Sequence Generation Using Grid Beam Search , 2017, ACL.

[47]  Ingmar Weber,et al.  Automated Hate Speech Detection and the Problem of Offensive Language , 2017, ICWSM.

[48]  Basura Fernando,et al.  Guided Open Vocabulary Image Captioning with Constrained Beam Search , 2016, EMNLP.

[49]  Björn Ross,et al.  Measuring the Reliability of Hate Speech Annotations: The Case of the European Refugee Crisis , 2016, ArXiv.

[50]  Zeerak Waseem,et al.  Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter , 2016, NLP+CSS@EMNLP.

[51]  K. Nadal,et al.  The Impact of Racial Microaggressions on Mental Health: Counseling Implications for Clients of Color , 2014 .

[52]  Craig A. Anderson,et al.  Arabs as Terrorists: Effects of Stereotypes Within Violent Contexts on Attitudes, Perceptions, and Affect , 2013 .

[53]  Dana E. Mastro,et al.  Mean Girls? The Influence of Gender Portrayals in Teen Movies on Emerging Adults' Gender-Based Attitudes and Beliefs , 2008 .

[54]  D. W. Sue,et al.  Racial microaggressions in everyday life: implications for clinical practice. , 2007, The American psychologist.

[55]  S. Lemon,et al.  The Ambivalent Sexism Inventory : Differentiating Hostile and Benevolent Sexism , 2001 .

[56]  Sapna Cheryan,et al.  When Positive Stereotypes Threaten Intellectual Performance: The Psychological Hazards of “Model Minority” Status , 2000, Psychological science.

[57]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[58]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .