Can We Improve Model Robustness through Secondary Attribute Counterfactuals?

Developing robust NLP models that perform well on many, even small, slices of data is a significant but important challenge, with implications from fairness to general reliability. To this end, recent research has explored how models rely on spurious correlations, and how counterfactual data augmentation (CDA) can mitigate such issues. In this paper we study how and why modeling counterfactuals over multiple attributes can go significantly further in improving model performance. We propose RDI, a context-aware methodology which takes into account the impact of secondary attributes on the model’s predictions and increases sensitivity for secondary attributes over reweighted counterfactually augmented data. By implementing RDI in the context of toxicity detection, we find that accounting for secondary attributes can significantly improve robustness, with improvements in sliced accuracy on the original dataset up to 7% compared to existing robustness methods. We also demonstrate that RDI generalizes to the coreference resolution task and provide guidelines to extend this to other tasks.

[1]  Emily M. Bender,et al.  Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data , 2020, ACL.

[2]  Matt J. Kusner,et al.  Counterfactual Fairness , 2017, NIPS.

[3]  Dekang Lin,et al.  Bootstrapping Path-Based Pronoun Resolution , 2006, ACL.

[4]  Mohit Bansal,et al.  Avoiding Reasoning Shortcuts: Adversarial Evaluation, Training, and Model Development for Multi-Hop QA , 2019, ACL.

[5]  Jieyu Zhao,et al.  Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods , 2018, NAACL.

[6]  Ona de Gibert,et al.  Hate Speech Dataset from a White Supremacy Forum , 2018, ALW.

[7]  Yiannis Kompatsiaris,et al.  Adaptive Sensitive Reweighting to Mitigate Bias in Fairness-aware Classification , 2018, WWW.

[8]  Pascale Fung,et al.  Reducing Gender Bias in Abusive Language Detection , 2018, EMNLP.

[9]  Wanxiang Che,et al.  Generating Natural Language Adversarial Examples through Probability Weighted Word Saliency , 2019, ACL.

[10]  Yulia Tsvetkov,et al.  Unsupervised Discovery of Implicit Gender Bias , 2020, EMNLP.

[11]  Derek Ruths,et al.  A Web of Hate: Tackling Hateful Speech in Online Social Spaces , 2017, ArXiv.

[12]  Emre Kiciman,et al.  Distilling the Outcomes of Personal Experiences: A Propensity-scored Analysis of Social Media , 2017, CSCW.

[13]  Elias Bareinboim,et al.  Fairness in Decision-Making - The Causal Explanation Formula , 2018, AAAI.

[14]  Sameer Singh,et al.  Beyond Accuracy: Behavioral Testing of NLP Models with CheckList , 2020, ACL.

[15]  Conghui Zhu,et al.  Demographics Should Not Be the Reason of Toxicity: Mitigating Discrimination in Text Classifications with Instance Weighting , 2020, ACL.

[16]  Timothy A Miller Simplified Neural Unsupervised Domain Adaptation , 2019, NAACL-HLT.

[17]  Ralf Krestel,et al.  Challenges for Toxic Comment Classification: An In-Depth Error Analysis , 2018, ALW.

[18]  Slav Petrov,et al.  Measuring and Reducing Gendered Correlations in Pre-trained Models , 2020, ArXiv.

[19]  Tong Zhang,et al.  Reinforced Training Data Selection for Domain Adaptation , 2019, ACL.

[20]  Bhuwan Dhingra,et al.  Combating Adversarial Misspellings with Robust Word Recognition , 2019, ACL.

[21]  Lucy Vasserman,et al.  Measuring and Mitigating Unintended Bias in Text Classification , 2018, AIES.

[22]  Aditi Raghunathan,et al.  Certified Robustness to Adversarial Word Substitutions , 2019, EMNLP.

[23]  Carlos Guestrin,et al.  Semantically Equivalent Adversarial Rules for Debugging NLP models , 2018, ACL.

[24]  Yaser Al-Onaizan,et al.  Robustness to Capitalization Errors in Named Entity Recognition , 2019, EMNLP.

[25]  Marie desJardins,et al.  Evaluation and selection of biases in machine learning , 1995, Machine Learning.

[26]  Dejing Dou,et al.  HotFlip: White-Box Adversarial Examples for Text Classification , 2017, ACL.

[27]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[28]  Timothy Baldwin,et al.  What’s in a Domain? Learning Domain-Robust Text Representations using Adversarial Training , 2018, NAACL.

[29]  Paolo Rosso,et al.  SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter , 2019, *SEMEVAL.

[30]  Ankur Taly,et al.  Counterfactual Fairness in Text Classification through Robustness , 2018, AIES.

[31]  David Lopez-Paz,et al.  Invariant Risk Minimization , 2019, ArXiv.

[32]  Percy Liang,et al.  Distributionally Robust Language Modeling , 2019, EMNLP.

[33]  Katherine A. Keith,et al.  Text and Causal Inference: A Review of Using Text to Remove Confounding from Causal Estimates , 2020, ACL.

[34]  Ryan Cotterell,et al.  Counterfactual Data Augmentation for Mitigating Gender Stereotypes in Languages with Rich Morphology , 2019, ACL.

[35]  Jieyu Zhao,et al.  Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints , 2017, EMNLP.

[36]  Robert E. Schapire,et al.  The Boosting Approach to Machine Learning An Overview , 2003 .

[37]  Nianwen Xue,et al.  CoNLL-2011 Shared Task: Modeling Unrestricted Coreference in OntoNotes , 2011, CoNLL Shared Task.

[38]  Andrew Smart,et al.  Participatory Problem Formulation for Fairer Machine Learning Through Community Based System Dynamics , 2020, ArXiv.

[39]  Varvara Logacheva,et al.  Robust Word Vectors: Context-Informed Embeddings for Noisy Texts , 2018, NUT@EMNLP.

[40]  Yoav Goldberg,et al.  Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them , 2019, NAACL-HLT.

[41]  Rachel Rudinger,et al.  Gender Bias in Coreference Resolution , 2018, NAACL.

[42]  Alan W Black,et al.  Measuring Bias in Contextualized Word Representations , 2019, Proceedings of the First Workshop on Gender Bias in Natural Language Processing.

[43]  Kai-Wei Chang,et al.  Learning to Discriminate Perturbations for Blocking Adversarial Attacks in Text Classification , 2019, EMNLP.

[44]  Zachary Chase Lipton,et al.  Learning the Difference that Makes a Difference with Counterfactually-Augmented Data , 2019, ICLR.

[45]  Munmun De Choudhury,et al.  The Language of Social Support in Social Media and Its Effect on Suicidal Ideation Risk , 2017, ICWSM.

[46]  Ralph Grishman,et al.  Domain Adaptation for Relation Extraction with Domain Adversarial Neural Network , 2017, IJCNLP.

[47]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[48]  Manish Shrivastava,et al.  Degree based Classification of Harmful Speech using Twitter Data , 2018, TRAC@COLING 2018.

[49]  Mani B. Srivastava,et al.  Generating Natural Language Adversarial Examples , 2018, EMNLP.

[50]  Adam Tauman Kalai,et al.  Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , 2016, NIPS.

[51]  Paula Fortuna,et al.  Toxic, Hateful, Offensive or Abusive? What Are We Really Classifying? An Empirical Analysis of Hate Speech Datasets , 2020, LREC.

[52]  Nathan Srebro,et al.  Equality of Opportunity in Supervised Learning , 2016, NIPS.

[53]  Solon Barocas,et al.  Language (Technology) is Power: A Critical Survey of “Bias” in NLP , 2020, ACL.

[54]  Dirk Hovy,et al.  Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter , 2016, NAACL.

[55]  Margaret Mitchell,et al.  Perturbation Sensitivity Analysis to Detect Unintended Model Biases , 2019, EMNLP.

[56]  Allison Woodruff,et al.  Putting Fairness Principles into Practice: Challenges, Metrics, and Improvements , 2019, AIES.

[57]  Ramesh Nallapati,et al.  Domain Adaptation with BERT-based Domain Classification and Data Selection , 2019, EMNLP.

[58]  Yuchen Zhang,et al.  CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes , 2012, EMNLP-CoNLL Shared Task.

[59]  Anupam Datta,et al.  Gender Bias in Neural Natural Language Processing , 2018, Logic, Language, and Security.

[60]  Percy Liang,et al.  Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.