Detecting Cross-Geographic Biases in Toxicity Modeling on Social Media

Online social media platforms increasingly rely on Natural Language Processing (NLP) techniques to detect abusive content at scale in order to mitigate the harms it causes to their users. However, these techniques suffer from various sampling and association biases present in training data, often resulting in sub-par performance on content relevant to marginalized groups, potentially furthering disproportionate harms towards them. Studies on such biases so far have focused on only a handful of axes of disparities and subgroups that have annotations/lexicons available. Consequently, biases concerning non-Western contexts are largely ignored in the literature. In this paper, we introduce a weakly supervised method to robustly detect lexical biases in broader geo-cultural contexts. Through a case study on a publicly available toxicity detection model, we demonstrate that our method identifies salient groups of cross-geographic errors, and, in a follow up, demonstrate that these groupings reflect human judgments of offensive and inoffensive language in those geographic contexts. We also conduct analysis of a model trained on a dataset with ground truth labels to better understand these biases, and present preliminary mitigation experiments.

[1]  Siva Reddy,et al.  StereoSet: Measuring stereotypical bias in pretrained language models , 2020, ACL.

[2]  Zeerak Waseem,et al.  Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter , 2016, NLP+CSS@EMNLP.

[3]  Bernard J. Jansen,et al.  Online Hate Ratings Vary by Extremes: A Statistical Analysis , 2019, CHIIR.

[4]  Yejin Choi,et al.  Challenges in Automated Debiasing for Toxic Language Detection , 2021, EACL.

[5]  Ashiqur KhudaBukhsh,et al.  The Non-native Speaker Aspect: Indian English in Social Media , 2020, WNUT.

[6]  Xiang Ren,et al.  Fair Hate Speech Detection through Evaluation of Social Group Counterfactuals , 2020, ArXiv.

[7]  Radha Poovendran,et al.  Deceiving Google's Perspective API Built for Detecting Toxic Comments , 2017, ArXiv.

[8]  Pascale Fung,et al.  Reducing Gender Bias in Abusive Language Detection , 2018, EMNLP.

[9]  Emily Denton,et al.  Social Biases in NLP Models as Barriers for Persons with Disabilities , 2020, ACL.

[10]  Ankit Kumar,et al.  User Generated Data: Achilles' heel of BERT , 2020, ArXiv.

[11]  Bernard J. Jansen,et al.  Online Hate Interpretation Varies by Country, But More by Individual: A Statistical Analysis Using Crowdsourced Ratings , 2018, 2018 Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS).

[12]  Lucy Vasserman,et al.  Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification , 2019, WWW.

[13]  Margaret Mitchell,et al.  Perturbation Sensitivity Analysis to Detect Unintended Model Biases , 2019, EMNLP.

[14]  David Bamman,et al.  Gender identity and lexical variation in social media , 2012, 1210.4567.

[15]  Kathleen McKeown,et al.  Automatically Inferring Gender Associations from Language , 2019, EMNLP.

[16]  Ralf Krestel,et al.  Challenges for Toxic Comment Classification: An In-Depth Error Analysis , 2018, ALW.

[17]  Rachel Rudinger,et al.  “You Are Grounded!”: Latent Name Artifacts in Pre-trained Language Models , 2020, EMNLP.

[18]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[19]  Antonios Anastasopoulos,et al.  Towards Robust Toxic Content Classification , 2019, ArXiv.

[20]  Lucy Vasserman,et al.  Measuring and Mitigating Unintended Bias in Text Classification , 2018, AIES.

[21]  Ben Hutchinson,et al.  Re-imagining Algorithmic Fairness in India and Beyond , 2021, FAccT.

[22]  S. Fiske Prejudices in Cultural Contexts: Shared Stereotypes (Gender, Age) Versus Variable Stereotypes (Race, Ethnicity, Religion) , 2017, Perspectives on psychological science : a journal of the Association for Psychological Science.

[23]  Daniel Jurafsky,et al.  Word embeddings quantify 100 years of gender and ethnic stereotypes , 2017, Proceedings of the National Academy of Sciences.

[24]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[25]  Chandler May,et al.  On Measuring Social Biases in Sentence Encoders , 2019, NAACL.

[26]  Ingmar Weber,et al.  Racial Bias in Hate Speech and Abusive Language Detection Datasets , 2019, Proceedings of the Third Workshop on Abusive Language Online.

[27]  Mauro Conti,et al.  All You Need is "Love": Evading Hate Speech Detection , 2018, ArXiv.

[28]  Burt L. Monroe,et al.  Fightin' Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict , 2008, Political Analysis.

[29]  Ankur Taly,et al.  Counterfactual Fairness in Text Classification through Robustness , 2018, AIES.

[30]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[31]  Yejin Choi,et al.  The Risk of Racial Bias in Hate Speech Detection , 2019, ACL.

[32]  Arvind Narayanan,et al.  Semantics derived automatically from language corpora contain human-like biases , 2016, Science.

[33]  Alan W Black,et al.  Measuring Bias in Contextualized Word Representations , 2019, Proceedings of the First Workshop on Gender Bias in Natural Language Processing.