BBQ: A Hand-Built Bias Benchmark for Question Answering

It is well documented that NLP models learn social biases present in the world, but little work has been done to show how these biases manifest in actual model outputs for applied tasks like question answering (QA). We introduce the Bias Benchmark for QA (BBQ), a dataset consisting of question-sets constructed by the authors that highlight attested social biases against people belonging to protected classes along nine different social dimensions relevant for U.S. English-speaking contexts. Our task evaluates model responses at two distinct levels: (i) given an under-informative context, test how strongly model answers reflect social biases, and (ii) given an adequately informative context, test whether the model’s biases still override a correct answer choice. We find that models strongly rely on stereotypes when the context is ambiguous, meaning that the model’s outputs consistently reproduce harmful biases in this setting. Though models are much more accurate when the context provides an unambiguous answer, they still rely on stereotyped information and achieve an accuracy 2.5 percentage points higher on examples where the correct answer aligns with a social bias, with this accuracy difference widening to 5 points for examples targeting gender.

[1]  Douwe Kiela,et al.  Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection , 2021, Annual Meeting of the Association for Computational Linguistics.

[2]  Yang Trista Cao,et al.  Toward Gender-Inclusive Coreference Resolution , 2019, ACL.

[3]  Chandler May,et al.  On Measuring Social Biases in Sentence Encoders , 2019, NAACL.

[4]  Jason Baldridge,et al.  Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns , 2018, TACL.

[5]  Ingmar Weber,et al.  Racial Bias in Hate Speech and Abusive Language Detection Datasets , 2019, Proceedings of the Third Workshop on Abusive Language Online.

[6]  Nanyun Peng,et al.  The Woman Worked as a Babysitter: On Biases in Language Generation , 2019, EMNLP.

[7]  Anupam Datta,et al.  Gender Bias in Neural Natural Language Processing , 2018, Logic, Language, and Security.

[8]  C. Koedel,et al.  Race and gender effects on employer interest in job applicants: new evidence from a resume field experiment , 2016 .

[9]  Konstantinos Tzioumis,et al.  Demographic aspects of first names , 2018, Scientific Data.

[10]  Koustuv Sinha,et al.  Evaluating Gender Bias in Natural Language Inference , 2021, ArXiv.

[11]  J. Kasof,et al.  Sex bias in the naming of stimulus persons. , 1993, Psychological bulletin.

[12]  R. Mar,et al.  Zahra or Zoe, Arjun or Andrew? Bicultural baby names reflect identity and pragmatic concerns. , 2020, Cultural diversity & ethnic minority psychology.

[13]  Keiko Nakao,et al.  Updating Occupational Prestige and Socioeconomic Scores: How the New Measures Measure up , 1994 .

[14]  Alexandra Chouldechova,et al.  What’s in a Name? Reducing Bias in Bios without Access to Protected Attributes , 2019, NAACL.

[15]  Daniel Khashabi,et al.  UNQOVERing Stereotypical Biases via Underspecified Questions , 2020, FINDINGS.

[16]  Dong Nguyen,et al.  HateCheck: Functional Tests for Hate Speech Detection Models , 2021, ACL/IJCNLP.

[17]  Arvind Narayanan,et al.  Semantics derived automatically from language corpora contain human-like biases , 2016, Science.

[18]  Yejin Choi,et al.  Social Bias Frames: Reasoning about Social and Power Implications of Language , 2020, ACL.

[19]  Nanyun Peng,et al.  What do Bias Measures Measure? , 2021, ArXiv.

[20]  Shikha Bordia,et al.  Identifying and Reducing Gender Bias in Word-Level Language Models , 2019, NAACL.

[21]  Solon Barocas,et al.  Language (Technology) is Power: A Critical Survey of “Bias” in NLP , 2020, ACL.

[22]  Rachel Rudinger,et al.  Gender Bias in Coreference Resolution , 2018, NAACL.

[23]  Vivek Srikumar,et al.  On Measuring and Mitigating Biased Inferences of Word Embeddings , 2019, AAAI.

[24]  Liam Magee,et al.  Intersectional Bias in Causal Language Models , 2021, ArXiv.

[25]  Hannaneh Hajishirzi,et al.  UnifiedQA: Crossing Format Boundaries With a Single QA System , 2020, FINDINGS.

[26]  Bruce J Hillman Gender Bias. , 2018, Journal of the American College of Radiology : JACR.

[27]  Robert Munro,et al.  Detecting Independent Pronoun Bias with Partially-Synthetic Data Generation , 2020, EMNLP.

[28]  Ellen D. Wu “They Call Me Bruce, But They Won't Call Me Bruce Jones:” Asian American Naming Preferences and Patterns , 1999 .

[29]  Hanna M. Wallach,et al.  Stereotyping Norwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets , 2021, ACL.