Hurtful words: quantifying biases in clinical contextual word embeddings

In this work, we examine the extent to which embeddings may encode marginalized populations differently, and how this may lead to a perpetuation of biases and worsened performance on clinical tasks. We pretrain deep embedding models (BERT) on medical notes from the MIMIC-III hospital dataset, and quantify potential disparities using two approaches. First, we identify dangerous latent relationships that are captured by the contextual word embeddings using a fill-in-the-blank method with text from real clinical notes and a log probability bias score quantification. Second, we evaluate performance gaps across different definitions of fairness on over 50 downstream clinical prediction tasks that include detection of acute and chronic conditions. We find that classifiers trained from BERT representations exhibit statistically significant differences in performance, often favoring the majority group with regards to gender, language, ethnicity, and insurance status. Finally, we explore shortcomings of using adversarial debiasing to obfuscate subgroup information in contextual word embeddings, and recommend best practices for such deep embedding models in clinical settings.

[1]  M. Sabshin,et al.  Dimensions of institutional racism in psychiatry. , 1970, The American journal of psychiatry.

[2]  T. Osler,et al.  Trends in racial disparities for injured patients admitted to trauma centers. , 2013, Health services research.

[3]  S. Meghani,et al.  Time to take stock: a meta-analysis and systematic review of analgesic treatment disparities for pain in the United States. , 2012, Pain medicine.

[4]  Jieyu Zhao,et al.  Adversarial Removal of Gender from Deep Image Representations , 2018, ArXiv.

[5]  Gari D. Clifford,et al.  A New Severity of Illness Scale Using a Subset of Acute Physiology and Chronic Health Evaluation Data Elements Shows Comparable Predictive Accuracy* , 2013, Critical care medicine.

[6]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[7]  Jingqi Wang,et al.  Enhancing Clinical Concept Extraction with Contextual Embedding , 2019, J. Am. Medical Informatics Assoc..

[8]  Wei-Hung Weng,et al.  Publicly Available Clinical BERT Embeddings , 2019, Proceedings of the 2nd Clinical Natural Language Processing Workshop.

[9]  Rajesh Ranganath,et al.  ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission , 2019, ArXiv.

[10]  I. Kohane,et al.  Framing the challenges of artificial intelligence in medicine , 2018, BMJ Quality & Safety.

[11]  Eirini Ntoutsi,et al.  Dealing with Bias via Data Augmentation in Supervised Learning Scenarios , 2018 .

[12]  Wanxiang Che,et al.  Pre-Training with Whole Word Masking for Chinese BERT , 2019, ArXiv.

[13]  M. Carroll,et al.  Hypertension Prevalence and Control Among Adults: United States, 2011-2014. , 2015, NCHS data brief.

[14]  Peter Szolovits,et al.  Clinical Intervention Prediction and Understanding with Deep Neural Networks , 2017, MLHC.

[15]  Kevin Fiscella,et al.  Disparities in Health Care by Race, Ethnicity, and Language Among the Insured: Findings From a National Sample , 2002, Medical care.

[16]  Adam Tauman Kalai,et al.  Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , 2016, NIPS.

[17]  E. Vayena,et al.  Machine learning in medicine: Addressing ethical challenges , 2018, PLoS medicine.

[18]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[19]  Marzyeh Ghassemi,et al.  CheXclusion: Fairness gaps in deep chest X-ray classifiers , 2020, PSB.

[20]  James Zou,et al.  AI can be sexist and racist — it’s time to make it fair , 2018, Nature.

[21]  R. Glynn,et al.  The Wilcoxon Signed Rank Test for Paired Comparisons of Clustered Data , 2006, Biometrics.

[22]  Yi Chern Tan,et al.  Assessing Social and Intersectional Biases in Contextualized Word Representations , 2019, NeurIPS.

[23]  K. Lum,et al.  To predict and serve? , 2016 .

[24]  Inioluwa Deborah Raji,et al.  Model Cards for Model Reporting , 2018, FAT.

[25]  Amos J. Storkey,et al.  Censoring Representations with an Adversary , 2015, ICLR.

[26]  Alexandra Chouldechova,et al.  Bias in Bios: A Case Study of Semantic Representation Bias in a High-Stakes Setting , 2019, FAT.

[27]  Zhe Zhao,et al.  Data Decisions and Theoretical Implications when Adversarially Learning Fair Representations , 2017, ArXiv.

[28]  David Sontag,et al.  Why Is My Classifier Discriminatory? , 2018, NeurIPS.

[29]  Richard S. Zemel,et al.  Understanding the Origins of Bias in Word Embeddings , 2018, ICML.

[30]  Nathan Srebro,et al.  Equality of Opportunity in Supervised Learning , 2016, NIPS.

[31]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[32]  Gregory S Nelson,et al.  Bias in Artificial Intelligence , 2019, North Carolina Medical Journal.

[33]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[34]  G. Anderson,et al.  Out-of-pocket medical spending for care of chronic conditions. , 2001, Health affairs.

[35]  Jon M. Kleinberg,et al.  Inherent Trade-Offs in the Fair Determination of Risk Scores , 2016, ITCS.

[36]  R. Schwartz,et al.  Racial disparities in psychotic disorder diagnosis: A review of empirical literature. , 2014, World journal of psychiatry.

[37]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[38]  Jieyu Zhao,et al.  Balanced Datasets Are Not Enough: Estimating and Mitigating Gender Bias in Deep Image Representations , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[39]  Yoav Goldberg,et al.  Adversarial Removal of Demographic Attributes from Text Data , 2018, EMNLP.

[40]  Ryan Cotterell,et al.  Gender Bias in Contextualized Word Embeddings , 2019, NAACL.

[41]  Alan W Black,et al.  Measuring Bias in Contextualized Word Representations , 2019, Proceedings of the First Workshop on Gender Bias in Natural Language Processing.

[42]  Daniel Jurafsky,et al.  Word embeddings quantify 100 years of gender and ethnic stereotypes , 2017, Proceedings of the National Academy of Sciences.

[43]  S. Kegeles,et al.  ‘Triply cursed’: racism, homophobia and HIV-related stigma are barriers to regular HIV testing, treatment adherence and disclosure among young Black gay men , 2014, Culture, health & sexuality.

[44]  S. Ibrahim,et al.  Racial variation in the use of do-not-resuscitate orders , 1999, Journal of General Internal Medicine.

[45]  Ben J. Marafino,et al.  Creating Fair Models of Atherosclerotic Cardiovascular Disease Risk , 2018, AIES.

[46]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[47]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[48]  M. Ghassemi,et al.  Can AI Help Reduce Disparities in General Medical and Mental Health Care? , 2019, AMA journal of ethics.

[49]  Toniann Pitassi,et al.  Learning Adversarially Fair and Transferable Representations , 2018, ICML.

[50]  S. Lemeshow,et al.  A new Simplified Acute Physiology Score (SAPS II) based on a European/North American multicenter study. , 1993, JAMA.

[51]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[52]  Alan E Jones,et al.  The Sequential Organ Failure Assessment score for predicting outcome in patients with severe sepsis and evidence of hypoperfusion at the time of emergency department presentation* , 2009, Critical care medicine.

[53]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[54]  Zeyu Li,et al.  Learning Gender-Neutral Word Embeddings , 2018, EMNLP.

[55]  Eric J Topol,et al.  High-performance medicine: the convergence of human and artificial intelligence , 2019, Nature Medicine.

[56]  Kristina Lerman,et al.  A Survey on Bias and Fairness in Machine Learning , 2019, ACM Comput. Surv..

[57]  Toniann Pitassi,et al.  Learning Fair Representations , 2013, ICML.

[58]  Kyle Lo,et al.  SciBERT: Pretrained Contextualized Embeddings for Scientific Text , 2019, ArXiv.

[59]  Blake Lemoine,et al.  Mitigating Unwanted Biases with Adversarial Learning , 2018, AIES.

[60]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[61]  V. Vaccarino,et al.  Gender and age differences in chief complaints of acute myocardial infarction (Worcester Heart Attack Study). , 2004, The American journal of cardiology.

[62]  Noe Casas,et al.  Evaluating the Underlying Gender Bias in Contextualized Word Embeddings , 2019, Proceedings of the First Workshop on Gender Bias in Natural Language Processing.

[63]  J. Garrett,et al.  Effect of language on heart attack and stroke awareness among U.S. Hispanics. , 2006, American journal of preventive medicine.

[64]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[65]  Magnus Sahlgren,et al.  The Distributional Hypothesis , 2008 .

[66]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[67]  N. Groce,et al.  Multiculturalism, chronic illness, and disability. , 1993, Pediatrics.

[68]  Aram Galstyan,et al.  Multitask learning and benchmarking with clinical time series data , 2017, Scientific Data.

[69]  A. Ginde,et al.  Barriers to Prompt Presentation to Emergency Departments in Colorado after Onset of Stroke Symptoms , 2018, The western journal of emergency medicine.

[70]  Yoav Goldberg,et al.  Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them , 2019, NAACL-HLT.

[71]  Timnit Gebru,et al.  Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification , 2018, FAT.

[72]  Percy Liang,et al.  Fairness Without Demographics in Repeated Loss Minimization , 2018, ICML.

[73]  M. Howell,et al.  Ensuring Fairness in Machine Learning to Advance Health Equity , 2018, Annals of Internal Medicine.