Detoxifying Language Models Risks Marginalizing Minority Voices

Language models (LMs) must be both safe and equitable to be responsibly deployed in practice. With safety in mind, numerous detoxification techniques (e.g., Dathathri et al. 2020; Krause et al. 2020) have been proposed to mitigate toxic LM generations. In this work, we show that these detoxification techniques hurt equity: they decrease the utility of LMs on language used by marginalized groups (e.g., African-American English and minority identity mentions). In particular, we perform automatic and human evaluations of text generation quality when LMs are conditioned on inputs with different dialects and group identifiers. We find that detoxification makes LMs more brittle to distribution shift, especially on language used by marginalized groups. We identify that these failures stem from detoxification methods exploiting spurious correlations in toxicity datasets. Overall, our results highlight the tension between the controllability and distributional robustness of LMs.

[1]  Yejin Choi,et al.  Challenges in Automated Debiasing for Toxic Language Detection , 2021, EACL.

[2]  Richard Socher,et al.  GeDi: Generative Discriminator Guided Sequence Generation , 2021, EMNLP.

[3]  Peter Henderson,et al.  Ethical Challenges in Data-Driven Dialogue Systems , 2017, AIES.

[4]  Lucy Vasserman,et al.  Measuring and Mitigating Unintended Bias in Text Classification , 2018, AIES.

[5]  Michael McCloskey,et al.  Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , 1989 .

[6]  Yann Dauphin,et al.  Hierarchical Neural Story Generation , 2018, ACL.

[7]  Sameer Singh,et al.  Universal Adversarial Triggers for Attacking and Analyzing NLP , 2019, EMNLP.

[8]  J. Rosa,et al.  Undoing Appropriateness: Raciolinguistic Ideologies and Language Diversity in Education , 2015 .

[9]  Chris J. Kennedy,et al.  Constructing interval variables via faceted Rasch measurement and multitask deep learning: a hate speech application , 2020, ArXiv.

[10]  Doug Downey,et al.  Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[11]  J. Weston,et al.  Recipes for Safety in Open-domain Chatbots , 2020, ArXiv.

[12]  Yejin Choi,et al.  RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , 2020, FINDINGS.

[13]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[14]  William Yang Wang,et al.  Dats Wassup!!: Investigating African-American Vernacular English in Transformer-Based Text Generation , 2020, EMNLP.

[15]  Nanyun Peng,et al.  Towards Controllable Biases in Language Generation , 2020, FINDINGS.

[16]  J. Hunter African American English: A Linguistic Introduction , 2002 .

[17]  Saif Mohammad,et al.  Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems , 2018, *SEMEVAL.

[18]  Solon Barocas,et al.  Language (Technology) is Power: A Critical Survey of “Bias” in NLP , 2020, ACL.

[19]  Jason Yosinski,et al.  Plug and Play Language Models: A Simple Approach to Controlled Text Generation , 2020, ICLR.

[20]  Percy Liang,et al.  Distributionally Robust Language Modeling , 2019, EMNLP.

[21]  Thiago Dias Oliva,et al.  Fighting Hate Speech, Silencing Drag Queens? Artificial Intelligence in Content Moderation and Risks to LGBTQ Voices Online , 2020, Sexuality & Culture.

[22]  Luke Zettlemoyer,et al.  Don’t Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases , 2019, EMNLP.

[23]  Sameer Singh,et al.  Eliciting Knowledge from Language Models Using Automatically Generated Prompts , 2020, EMNLP.

[24]  Lav R. Varshney,et al.  CTRL: A Conditional Transformer Language Model for Controllable Generation , 2019, ArXiv.

[25]  Yejin Choi,et al.  The Risk of Racial Bias in Hate Speech Detection , 2019, ACL.

[26]  Yejin Choi,et al.  Social Bias Frames: Reasoning about Social and Power Implications of Language , 2020, ACL.