Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models

Large language models produce human-like text that drive a growing number of applications. However, recent literature and, increasingly, real world observations, have demonstrated that these models can generate language that is toxic, biased, untruthful or otherwise harmful. Though work to evaluate language model harms is under way, translating foresight about which harms may arise into rigorous benchmarks is not straightforward. To facilitate this translation, we outline six ways of characterizing harmful text which merit explicit consideration when designing new benchmarks. We then use these characteristics as a lens to identify trends and gaps in existing benchmarks. Finally, we apply them in a case study of the Perspective API, a toxicity classifier that is widely used in harm benchmarks. Our characteristics provide one piece of the bridge that translates between foresight and effective evaluation.

[1]  Manjary P Gangan,et al.  Towards an Enhanced Understanding of Bias in Pre-trained Neural Language Models: A Survey with Special Emphasis on Affective Bias , 2022, ArXiv.

[2]  Miryam de Lhoneux,et al.  Challenges and Strategies in Cross-Cultural NLP , 2022, ACL.

[3]  Dipankar Ray,et al.  ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection , 2022, ACL.

[4]  Phu Mon Htut,et al.  BBQ: A hand-built bias benchmark for question answering , 2021, FINDINGS.

[5]  Owain Evans,et al.  TruthfulQA: Measuring How Models Mimic Human Falsehoods , 2021, ACL.

[6]  Po-Sen Huang,et al.  Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.

[7]  Po-Sen Huang,et al.  Ethical and social risks of harm from Language Models , 2021, ArXiv.

[8]  Leo Laugier,et al.  Toxicity Detection can be Sensitive to the Conversational Context , 2021, ArXiv.

[9]  Po-Sen Huang,et al.  Challenges in Detoxifying Language Models , 2021, EMNLP.

[10]  Michael S. Bernstein,et al.  On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[11]  Nanyun Peng,et al.  What do Bias Measures Measure? , 2021, ArXiv.

[12]  Eduard H. Hovy,et al.  Five sources of bias in natural language processing , 2021, Lang. Linguistics Compass.

[13]  Ruslan Salakhutdinov,et al.  Towards Understanding and Mitigating Social Biases in Language Models , 2021, ICML.

[14]  Christy Dennison,et al.  Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets , 2021, NeurIPS.

[15]  Jason Weston,et al.  Bot-Adversarial Dialogue for Safe Conversational Agents , 2021, NAACL.

[16]  Leonardo Neves,et al.  On Transferability of Bias Mitigation Effects in Language Model Fine-Tuning , 2021, NAACL.

[17]  Diyi Yang,et al.  The Importance of Modeling Social Factors of Language: Theory and Practice , 2021, NAACL.

[18]  Kai-Wei Chang,et al.  Societal Biases in Language Generation: Progress and Challenges , 2021, ACL.

[19]  Shafiq R. Joty,et al.  Reliability Testing for Natural Language Processing Systems , 2021, ACL.

[20]  Dan Klein,et al.  Detoxifying Language Models Risks Marginalizing Minority Voices , 2021, NAACL.

[21]  Tom Everitt,et al.  Alignment of Language Agents , 2021, ArXiv.

[22]  Emily M. Bender,et al.  On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.

[23]  Timo Schick,et al.  Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP , 2021, Transactions of the Association for Computational Linguistics.

[24]  Jackie Kay,et al.  Fairness for Unobserved Characteristics: Insights from Technological Impacts on Queer Communities , 2021, AIES.

[25]  Kai-Wei Chang,et al.  BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation , 2021, FAccT.

[26]  Ben Hutchinson,et al.  Re-imagining Algorithmic Fairness in India and Beyond , 2021, FAccT.

[27]  James Zou,et al.  Persistent Anti-Muslim Bias in Large Language Models , 2021, AIES.

[28]  Adam Lopez,et al.  Intrinsic Bias Metrics Do Not Correlate with Application Bias , 2020, ACL.

[29]  Marc Dymetman,et al.  A Distributional Approach to Controlled Text Generation , 2020, ICLR.

[30]  Emily Denton,et al.  Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure , 2020, FAccT.

[31]  Shafiq R. Joty,et al.  GeDi: Generative Discriminator Guided Sequence Generation , 2020, EMNLP.

[32]  Siva Reddy,et al.  StereoSet: Measuring stereotypical bias in pretrained language models , 2020, ACL.

[33]  Kristina Lerman,et al.  A Survey on Bias and Fairness in Machine Learning , 2019, ACM Comput. Surv..

[34]  Jordan L. Boyd-Graber,et al.  Toward Deconfounding the Effect of Entity Demographics for Question Answering Accuracy , 2021, Conference on Empirical Methods in Natural Language Processing.

[35]  Hanna M. Wallach,et al.  Stereotyping Norwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets , 2021, ACL.

[36]  O. Keyes You Keep Using That Word: Ways of Thinking about Gender in Computing Research , 2021 .

[37]  I. Kivlichan,et al.  Capturing Covertly Toxic Speech via Crowdsourcing , 2021, HCINLP.

[38]  Phil Blunsom,et al.  Pitfalls of Static Language Modelling , 2021, ArXiv.

[39]  Kalina Bontcheva,et al.  Toxic Language Detection in Social Media for Brazilian Portuguese: New Dataset and Multilingual Analysis , 2020, AACL.

[40]  Daniel Khashabi,et al.  UNQOVERing Stereotypical Biases via Underspecified Questions , 2020, FINDINGS.

[41]  Samuel R. Bowman,et al.  CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models , 2020, EMNLP.

[42]  Shakir Mohamed,et al.  Decolonial AI: Decolonial Theory as Sociotechnical Foresight in Artificial Intelligence , 2020, Philosophy & Technology.

[43]  Andrew Smart,et al.  Extending the Machine Learning Abstraction Boundary: A Complex Systems Approach to Incorporate Societal Context , 2020, ArXiv.

[44]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[45]  Solon Barocas,et al.  Language (Technology) is Power: A Critical Survey of “Bias” in NLP , 2020, ACL.

[46]  Sameer Singh,et al.  Beyond Accuracy: Behavioral Testing of NLP Models with CheckList , 2020, ACL.

[47]  Alexandra Chouldechova,et al.  A snapshot of the frontiers of fairness in machine learning , 2020, Commun. ACM.

[48]  Emily Denton,et al.  Unintended machine learning biases as social barriers for persons with disabilitiess , 2020, ACM SIGACCESS Access. Comput..

[49]  Emily Denton,et al.  Towards a critical race methodology in algorithmic fairness , 2019, FAT*.

[50]  Dirk Hovy,et al.  Predictive Biases in Natural Language Processing Models: A Conceptual Framework and Overview , 2019, ACL.

[51]  Po-Sen Huang,et al.  Reducing Sentiment Bias in Language Models via Counterfactual Evaluation , 2019, FINDINGS.

[52]  J. Yosinski,et al.  Plug and Play Language Models: A Simple Approach to Controlled Text Generation , 2019, ICLR.

[53]  Meredith Ringel Morris,et al.  Toward fairness in AI for people with disabilities SBG@a research roadmap , 2019, ACM SIGACCESS Access. Comput..

[54]  Verena Rieser,et al.  Conversational Assistants and Gender Stereotypes: Public Perceptions and Desiderata for Voice Personas , 2020, GEBNLP.

[55]  Yonatan Belinkov,et al.  Investigating Gender Bias in Language Models Using Causal Mediation Analysis , 2020, NeurIPS.

[56]  Yulia Tsvetkov,et al.  Fortifying Toxic Speech Detectors Against Veiled Toxicity , 2020, EMNLP.

[57]  Lina Dencik,et al.  What does it mean to 'solve' the problem of discrimination in hiring?: social, technical and legal perspectives from the UK on automated hiring systems , 2019, FAT*.

[58]  Nanyun Peng,et al.  The Woman Worked as a Babysitter: On Biases in Language Generation , 2019, EMNLP.

[59]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[60]  Lora Aroyo,et al.  Crowdsourcing Subjective Tasks: The Case Study of Understanding Toxicity in Online Discussions , 2019, WWW.

[61]  Sahin Cem Geyik,et al.  Fairness-Aware Ranking in Search & Recommendation Systems with Application to LinkedIn Talent Search , 2019, KDD.

[62]  Lucy Vasserman,et al.  Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification , 2019, WWW.

[63]  Danah Boyd,et al.  Fairness and Abstraction in Sociotechnical Systems , 2019, FAT.

[64]  Sendhil Mullainathan,et al.  Dissecting Racial Bias in an Algorithm that Guides Health Decisions for 70 Million People , 2019, FAT.

[65]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[66]  Inioluwa Deborah Raji,et al.  Model Cards for Model Reporting , 2018, FAT.

[67]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[68]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[69]  Lucy Vasserman,et al.  Measuring and Mitigating Unintended Bias in Text Classification , 2018, AIES.

[70]  Emily M. Bender,et al.  Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science , 2018, TACL.

[71]  Sharad Goel,et al.  The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning , 2018, ArXiv.

[72]  Brendan T. O'Connor,et al.  Twitter Universal Dependency Parsing for African-American and Mainstream American English , 2018, ACL.

[73]  Rachel Rudinger,et al.  Gender Bias in Coreference Resolution , 2018, NAACL.

[74]  Morgan Klaus Scheuerman,et al.  Gender Recognition or Gender Reductionism?: The Social Implications of Embedded Gender Recognition Systems , 2018, CHI.

[75]  Jieyu Zhao,et al.  Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods , 2018, NAACL.

[76]  Timnit Gebru,et al.  Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification , 2018, FAT.

[77]  Brendan T. O'Connor,et al.  A Dataset and Classifier for Recognizing Social Media English , 2017, NUT@EMNLP.

[78]  Michael Wiegand,et al.  A Survey on Hate Speech Detection using Natural Language Processing , 2017, SocialNLP@EACL.

[79]  Lucas Dixon,et al.  Ex Machina: Personal Attacks Seen at Scale , 2016, WWW.

[80]  Brendan T. O'Connor,et al.  Demographic Dialectal Variation in Social Media: A Case Study of African-American English , 2016, EMNLP.

[81]  Adam Tauman Kalai,et al.  Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , 2016, NIPS.

[82]  Emily M. Bender Linguistic I Ssues in L Anguage Technology Lilt on Achieving and Evaluating Language-independence in Nlp on Achieving and Evaluating Language-independence in Nlp , 2022 .

[83]  Andrea Esuli,et al.  SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining , 2010, LREC.

[84]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[85]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[86]  A. Davis Black Feminist Thought: Knowledge, Consciousness and the Politics of Empowerment , 1993 .

[87]  Linda R. Waugh Marked and unmarked: A choice between unequals in semiotic structure , 1982 .