Stereotyping Norwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets

Auditing NLP systems for computational harms like surfacing stereotypes is an elusive goal. Several recent efforts have focused on benchmark datasets consisting of pairs of contrastive sentences, which are often accompanied by metrics that aggregate an NLP system’s behavior on these pairs into measurements of harms. We examine four such benchmarks constructed for two NLP tasks: language modeling and coreference resolution. We apply a measurement modeling lens—originating from the social sciences—to inventory a range of pitfalls that threaten these benchmarks’ validity as measurement models for stereotyping. We find that these benchmarks frequently lack clear articulations of what is being measured, and we highlight a range of ambiguities and unstated assumptions that affect how these benchmarks conceptualize and operationalize stereotyping.

[1]  David J. Schneider,et al.  The Psychology of Stereotyping , 2003 .

[2]  Nanyun Peng,et al.  The Woman Worked as a Babysitter: On Biases in Language Generation , 2019, EMNLP.

[3]  Dragomir R. Radev,et al.  How to Analyze Political Attention with Minimal Assumptions and Costs , 2010 .

[4]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[5]  Csr Young,et al.  How to Do Things With Words , 2009 .

[6]  Amandalynne Paullada,et al.  Data and its (dis)contents: A survey of dataset development and use in machine learning research , 2020, Patterns.

[7]  Jieyu Zhao,et al.  Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods , 2018, NAACL.

[8]  Neide Mayumi Osada,et al.  Black Feminist Thought: Knowledge, Consciousness, and the Politics of Empowerment , 2008 .

[9]  David R. Thomas,et al.  A General Inductive Approach for Analyzing Qualitative Evaluation Data , 2006 .

[10]  Hanna M. Wallach,et al.  Measurement and Fairness , 2019, FAccT.

[11]  Rachel Rudinger,et al.  “You Are Grounded!”: Latent Name Artifacts in Pre-trained Language Models , 2020, EMNLP.

[12]  Yejin Choi,et al.  RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , 2020, FINDINGS.

[13]  Dimitra Gkatzia,et al.  Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions , 2020, INLG.

[14]  Siva Reddy,et al.  StereoSet: Measuring stereotypical bias in pretrained language models , 2020, ACL.

[15]  H. Hughes A Companion to Linguistic Anthropology , 2004 .

[16]  S. Fiske,et al.  An ambivalent alliance. Hostile and benevolent sexism as complementary justifications for gender inequality. , 2001, The American psychologist.

[17]  Samuel R. Bowman,et al.  CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models , 2020, EMNLP.

[18]  Margaret Mitchell,et al.  Perturbation Sensitivity Analysis to Detect Unintended Model Biases , 2019, EMNLP.

[19]  Christopher Padilla,et al.  Language and Identity , 2016 .

[20]  付伶俐 打磨Using Language,倡导新理念 , 2014 .

[21]  Alan W Black,et al.  Measuring Bias in Contextualized Word Representations , 2019, Proceedings of the First Workshop on Gender Bias in Natural Language Processing.

[22]  D. Collier,et al.  Measurement Validity: A Shared Standard for Qualitative and Quantitative Research , 2001, American Political Science Review.

[23]  Solon Barocas,et al.  Language (Technology) is Power: A Critical Survey of “Bias” in NLP , 2020, ACL.

[24]  Rachel Rudinger,et al.  Gender Bias in Coreference Resolution , 2018, NAACL.

[25]  James Zou,et al.  Persistent Anti-Muslim Bias in Large Language Models , 2021, AIES.

[27]  蒋家义 How to Do Things with Words之脉络分析 , 2009 .

[28]  Gabriella Kazai,et al.  When Are Search Completion Suggestions Problematic? , 2020, Proc. ACM Hum. Comput. Interact..