Challenges in Automated Debiasing for Toxic Language Detection

Biased associations have been a challenge in the development of classifiers for detecting toxic language, hindering both fairness and accuracy. As potential solutions, we investigate recently introduced debiasing methods for text classification datasets and models, as applied to toxic language detection. Our focus is on lexical (e.g., swear words, slurs, identity mentions) and dialectal markers (specifically African American English). Our comprehensive experiments establish that existing methods are limited in their ability to prevent biased behavior in current toxicity detectors. We then propose an automatic, dialect-aware data correction method, as a proof-of-concept. Despite the use of synthetic labels, this method reduces dialectal associations with toxicity. Overall, our findings show that debiasing a model trained on biased toxic language data is not as effective as simply relabeling the data to remove existing biases.

[1]  Timnit Gebru,et al.  Lessons from archives: strategies for collecting sociocultural data in machine learning , 2019, FAT*.

[2]  Luke Zettlemoyer,et al.  Don’t Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases , 2019, EMNLP.

[3]  Yue Ning,et al.  Empirical Analysis of Multi-Task Learning for Reducing Model Bias in Toxic Comment Detection , 2019, ArXiv.

[4]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[5]  Yejin Choi,et al.  Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics , 2020, EMNLP.

[6]  Ingmar Weber,et al.  Racial Bias in Hate Speech and Abusive Language Detection Datasets , 2019, Proceedings of the Third Workshop on Abusive Language Online.

[7]  J. Rosa,et al.  Unsettling race and language: Toward a raciolinguistic perspective , 2017, Language in Society.

[8]  Gianluca Stringhini,et al.  Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior , 2018, ICWSM.

[9]  Ingmar Weber,et al.  Automated Hate Speech Detection and the Problem of Offensive Language , 2017, ICWSM.

[10]  J. Hunter African American English: A Linguistic Introduction , 2002 .

[11]  Blake Lemoine,et al.  Mitigating Unwanted Biases with Adversarial Learning , 2018, AIES.

[12]  Yonatan Belinkov,et al.  End-to-End Bias Mitigation by Modelling Biases in Corpora , 2020, ACL.

[13]  Yejin Choi,et al.  RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , 2020, FINDINGS.

[14]  Yejin Choi,et al.  The Effect of Different Writing Tasks on Linguistic Style: A Case Study of the ROC Story Cloze Task , 2017, CoNLL.

[15]  Carlos Ortiz,et al.  Intersectional Bias in Hate Speech and Abusive Language Datasets , 2020, ArXiv.

[16]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[17]  Nathan Srebro,et al.  Equality of Opportunity in Supervised Learning , 2016, NIPS.

[18]  Lyle H. Ungar,et al.  User-Level Race and Ethnicity Predictors from Twitter Text , 2018, COLING.

[19]  Ronan Le Bras,et al.  Adversarial Filters of Dataset Biases , 2020, ICML.

[20]  Björn Ross,et al.  Measuring the Reliability of Hate Speech Annotations: The Case of the European Refugee Crisis , 2016, ArXiv.

[21]  Jason Weston,et al.  Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack , 2019, EMNLP.

[22]  Thiago Dias Oliva,et al.  Fighting Hate Speech, Silencing Drag Queens? Artificial Intelligence in Content Moderation and Risks to LGBTQ Voices Online , 2020, Sexuality & Culture.

[23]  Guy Bailey,et al.  AFRICAN-AMERICAN LANGUAGE USE: IDEOLOGY AND SO-CALLED OBSCENITY , 2013 .

[24]  Marta Dynel,et al.  The landscape of impoliteness research , 2015 .

[25]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[26]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[27]  Yejin Choi,et al.  Social Bias Frames: Reasoning about Social and Power Implications of Language , 2020, ACL.

[28]  Jieyu Zhao,et al.  Balanced Datasets Are Not Enough: Estimating and Mitigating Gender Bias in Deep Image Representations , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[30]  Brendan T. O'Connor,et al.  Demographic Dialectal Variation in Social Media: A Case Study of African-American English , 2016, EMNLP.

[31]  William Yang Wang,et al.  Dats Wassup!!: Investigating African-American Vernacular English in Transformer-Based Text Generation , 2020, EMNLP.

[32]  Lucy Vasserman,et al.  Measuring and Mitigating Unintended Bias in Text Classification , 2018, AIES.

[33]  Noel Crespi,et al.  Hate speech detection and racial bias mitigation in social media based on BERT model , 2020, PloS one.

[34]  Solon Barocas,et al.  Language (Technology) is Power: A Critical Survey of “Bias” in NLP , 2020, ACL.

[35]  Yulia Tsvetkov,et al.  Demoting Racial Bias in Hate Speech Detection , 2020, SOCIALNLP.

[36]  Iryna Gurevych,et al.  Mind the Trade-off: Debiasing NLU Models without Degrading the In-distribution Performance , 2020, ACL.

[37]  Adam M. Croom How to do things with slurs: Studies in the way of derogatory words , 2013 .

[38]  Sarah T. Roberts,et al.  Behind the Screen , 2019 .

[39]  Marta Dynel Swearing methodologically : the (im)politeness of expletives in anonymous commentaries on Youtube , 2012 .

[40]  G. Kasper Linguistic politeness:: Current research issues☆ , 1990 .

[41]  J. Rosa,et al.  Looking like a Language, Sounding like a Race , 2018 .

[42]  Pascale Fung,et al.  Reducing Gender Bias in Abusive Language Detection , 2018, EMNLP.

[43]  Haohan Wang,et al.  Unlearn Dataset Bias in Natural Language Inference by Fitting the Residual , 2019, EMNLP.

[44]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[45]  Björn Technau Going beyond hate speech: The pragmatics of ethnic slur terms , 2018, Lodz Papers in Pragmatics.

[46]  Yejin Choi,et al.  The Risk of Racial Bias in Hate Speech Detection , 2019, ACL.

[47]  Thomas Wolf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.