Perturbation Sensitivity Analysis to Detect Unintended Model Biases

Data-driven statistical Natural Language Processing (NLP) techniques leverage large amounts of language data to build models that can understand language. However, most language data reflect the public discourse at the time the data was produced, and hence NLP models are susceptible to learning incidental associations around named referents at a particular point in time, in addition to general linguistic meaning. An NLP system designed to model notions such as sentiment and toxicity should ideally produce scores that are independent of the identity of such entities mentioned in text and their social associations. For example, in a general purpose sentiment analysis system, a phrase such as I hate Katy Perry should be interpreted as having the same sentiment as I hate Taylor Swift. Based on this idea, we propose a generic evaluation framework, Perturbation Sensitivity Analysis, which detects unintended model biases related to named entities, and requires no new annotations or corpora. We demonstrate the utility of this analysis by employing it on two different NLP models --- a sentiment model and a toxicity model --- applied on online comments in English language from four different genres.

[1]  Daniel Jurafsky,et al.  Understanding Neural Networks through Representation Erasure , 2016, ArXiv.

[2]  Alexandra Chouldechova,et al.  The Frontiers of Fairness in Machine Learning , 2018, ArXiv.

[3]  Carlos Eduardo Scheidegger,et al.  Certifying and Removing Disparate Impact , 2014, KDD.

[4]  Saif Mohammad,et al.  Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems , 2018, *SEMEVAL.

[5]  Jason Baldridge,et al.  Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns , 2018, TACL.

[6]  Adam Tauman Kalai,et al.  Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , 2016, NIPS.

[7]  Yulia Tsvetkov,et al.  RtGender: A Corpus for Studying Differential Responses to Gender , 2018, LREC.

[8]  Toniann Pitassi,et al.  Fairness through awareness , 2011, ITCS '12.

[9]  Anne Marie Piper,et al.  Addressing Age-Related Bias in Sentiment Analysis , 2018, CHI.

[10]  Solon Barocas,et al.  Prediction-Based Decisions and Fairness: A Catalogue of Choices, Assumptions, and Definitions , 2018, 1811.07867.

[11]  Arvind Narayanan,et al.  Semantics derived automatically from language corpora contain human-like biases , 2016, Science.

[12]  Chandler May,et al.  On Measuring Social Biases in Sentence Encoders , 2019, NAACL.

[13]  Alan W Black,et al.  Black is to Criminal as Caucasian is to Police: Detecting and Removing Multiclass Bias in Word Embeddings , 2019, NAACL.

[14]  Ankur Taly,et al.  Counterfactual Fairness in Text Classification through Robustness , 2018, AIES.

[15]  Lucy Vasserman,et al.  Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification , 2019, WWW.

[16]  Danah Boyd,et al.  Fairness and Abstraction in Sociotechnical Systems , 2019, FAT.

[17]  Alexandra Chouldechova,et al.  Bias in Bios: A Case Study of Semantic Representation Bias in a High-Stakes Setting , 2019, FAT.

[18]  Brendan T. O'Connor,et al.  Racial Disparity in Natural Language Processing: A Case Study of Social Media African-American English , 2017, ArXiv.

[19]  Lucy Vasserman,et al.  Measuring and Mitigating Unintended Bias in Text Classification , 2018, AIES.

[20]  Yoav Goldberg,et al.  Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them , 2019, NAACL-HLT.