Fairness and Robustness in Invariant Learning: A Case Study in Toxicity Classification

Robustness is of central importance in machine learning and has given rise to the fields of domain generalization and invariant learning, which are concerned with improving performance on a test distribution distinct from but related to the training distribution. In light of recent work suggesting an intimate connection between fairness and robustness, we investigate whether algorithms from robust ML can be used to improve the fairness of classifiers that are trained on biased data and tested on unbiased data. We apply Invariant Risk Minimization (IRM), a domain generalization algorithm that employs a causal discovery inspired method to find robust predictors, to the task of fairly predicting the toxicity of internet comments. We show that IRM achieves better out-of-distribution accuracy and fairness than Empirical Risk Minimization (ERM) methods, and analyze both the difficulties that arise when applying IRM in practice and the conditions under which IRM will likely be effective in this scenario. We hope that this work will inspire further studies of how robust machine learning methods relate to algorithmic fairness.

[1]  Kurt Keutzer,et al.  Multi-source Domain Adaptation in the Deep Learning Era: A Systematic Survey , 2020, ArXiv.

[2]  Sandra D. Mitchell Dimensions of Scientific Law , 2000, Philosophy of Science.

[3]  Yishay Mansour,et al.  Robust domain adaptation , 2013, Annals of Mathematics and Artificial Intelligence.

[4]  Yoav Goldberg,et al.  Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection , 2020, ACL.

[5]  David Lopez-Paz,et al.  Invariant Risk Minimization , 2019, ArXiv.

[6]  Inioluwa Deborah Raji,et al.  Model Cards for Model Reporting , 2018, FAT.

[7]  Dacheng Tao,et al.  Domain Generalization via Conditional Invariant Representation , 2018, ArXiv.

[8]  Derek Ruths,et al.  A Web of Hate: Tackling Hateful Speech in Online Social Spaces , 2017, ArXiv.

[9]  Abeer Khan Reddit Mining to Understand Gendered Movements , 2020, EDBT/ICDT Workshops.

[10]  Timnit Gebru,et al.  Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification , 2018, FAT.

[11]  Marilyn A. Walker,et al.  A Corpus for Research on Deliberation and Debate , 2012, LREC.

[12]  Tomas Mikolov,et al.  Advances in Pre-Training Distributed Word Representations , 2017, LREC.

[13]  Jonas Peters,et al.  Causal inference by using invariant prediction: identification and confidence intervals , 2015, 1501.01332.

[14]  Yue Ning,et al.  Empirical Analysis of Multi-Task Learning for Reducing Model Bias in Toxic Comment Detection , 2019, ArXiv.

[15]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[16]  Saif Mohammad,et al.  Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems , 2018, *SEMEVAL.

[17]  Christina Heinze-Deml,et al.  Invariant Causal Prediction for Nonlinear Models , 2017, Journal of Causal Inference.

[18]  N. Cartwright Two Theorems on Invariance and Causality , 2003, Philosophy of Science.

[19]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[20]  Yi Chern Tan,et al.  Assessing Social and Intersectional Biases in Contextualized Word Representations , 2019, NeurIPS.

[21]  Ankur Taly,et al.  Counterfactual Fairness in Text Classification through Robustness , 2018, AIES.

[22]  Lucy Vasserman,et al.  Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification , 2019, WWW.

[23]  Nathan Srebro,et al.  Equality of Opportunity in Supervised Learning , 2016, NIPS.

[24]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[25]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[26]  Adam Tauman Kalai,et al.  Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , 2016, NIPS.

[27]  Andrew D. Selbst,et al.  Big Data's Disparate Impact , 2016 .

[28]  Maryam Najafian,et al.  Reducing sentiment polarity for demographic attributes in word embeddings using adversarial learning , 2020, FAT*.

[29]  Yo Joong Choe,et al.  An Empirical Study of Invariant Risk Minimization , 2020, ArXiv.

[30]  Jean-Baptiste Tristan,et al.  Unlocking Fairness: a Trade-off Revisited , 2019, NeurIPS.

[31]  Yejin Choi,et al.  The Risk of Racial Bias in Hate Speech Detection , 2019, ACL.

[32]  Marc Pouly,et al.  Text Similarity Estimation Based on Word Embeddings and Matrix Norms for Targeted Marketing , 2019, NAACL.

[33]  Elias Bareinboim,et al.  Equality of Opportunity in Classification: A Causal Approach , 2018, NeurIPS.

[34]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[35]  Toniann Pitassi,et al.  Fairness through awareness , 2011, ITCS '12.

[36]  Dacheng Tao,et al.  Domain Generalization via Conditional Invariant Representations , 2018, AAAI.

[37]  Matthias Bethge,et al.  Shortcut Learning in Deep Neural Networks , 2020, Nat. Mach. Intell..