CheXclusion: Fairness gaps in deep chest X-ray classifiers

Machine learning systems have received much attention recently for their ability to achieve expert-level performance on clinical tasks, particularly in medical imaging. Here, we examine the extent to which state-of-the-art deep learning classifiers trained to yield diagnostic labels from X-ray images are biased with respect to protected attributes. We train convolution neural networks to predict 14 diagnostic labels in 3 prominent public chest X-ray datasets: MIMIC-CXR, Chest-Xray8, CheXpert, as well as a multi-site aggregation of all those datasets. We evaluate the TPR disparity -- the difference in true positive rates (TPR) -- among different protected attributes such as patient sex, age, race, and insurance type as a proxy for socioeconomic status. We demonstrate that TPR disparities exist in the state-of-the-art classifiers in all datasets, for all clinical tasks, and all subgroups. A multi-source dataset corresponds to the smallest disparities, suggesting one way to reduce bias. We find that TPR disparities are not significantly correlated with a subgroup's proportional disease burden. As clinical models move from papers to products, we encourage clinical decision makers to carefully audit for algorithmic disparities prior to deployment. Our code can be found at, this https URL

[1]  N. Shah,et al.  Implementing Machine Learning in Health Care - Addressing Ethical Challenges. , 2018, The New England journal of medicine.

[2]  Yifan Yu,et al.  CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison , 2019, AAAI.

[3]  M. Howell,et al.  Ensuring Fairness in Machine Learning to Advance Health Equity , 2018, Annals of Internal Medicine.

[4]  Marcus A. Badgeley,et al.  Confounding variables can degrade generalization performance of radiological deep learning models , 2018, ArXiv.

[5]  Timnit Gebru,et al.  Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification , 2018, FAT.

[6]  Thorsten Joachims,et al.  Policy Learning for Fairness in Ranking , 2019, NeurIPS.

[7]  Stefan Bauer,et al.  On the Fairness of Disentangled Representations , 2019, NeurIPS.

[8]  D. Hoffmann,et al.  The girl who cried pain: a bias against women in the treatment of pain. , 2001, The Journal of law, medicine & ethics : a journal of the American Society of Law, Medicine & Ethics.

[9]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Shahrokh Valaee,et al.  Generalization of Deep Neural Networks for Chest Pathology Classification in X-Rays Using Generative Adversarial Networks , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Eric J Topol,et al.  High-performance medicine: the convergence of human and artificial intelligence , 2019, Nature Medicine.

[12]  Matt J. Kusner,et al.  Counterfactual Fairness , 2017, NIPS.

[13]  Toniann Pitassi,et al.  Learning Fair Representations , 2013, ICML.

[14]  Blake Lemoine,et al.  Mitigating Unwanted Biases with Adversarial Learning , 2018, AIES.

[15]  Sebastian Thrun,et al.  Dermatologist-level classification of skin cancer with deep neural networks , 2017, Nature.

[16]  Richard Duszak,et al.  A County-Level Analysis of the US Radiologist Workforce: Physician Supply and Subspecialty Characteristics. , 2018, Journal of the American College of Radiology : JACR.

[17]  Thorsten Dickhaus,et al.  Simultaneous Statistical Inference , 2014, Springer Berlin Heidelberg.

[18]  Rolf Holle,et al.  “Age matters”—German claims data indicate disparities in lung cancer care between elderly and young patients , 2019, PloS one.

[19]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20]  Percy Liang,et al.  Fairness Without Demographics in Repeated Loss Minimization , 2018, ICML.

[21]  S. Kennedy,et al.  Diagnostic Radiology in Liberia: A Country Report , 2015 .

[22]  M. Ghassemi,et al.  Can AI Help Reduce Disparities in General Medical and Mental Health Care? , 2019, AMA journal of ethics.

[23]  Ronald M. Summers,et al.  ChestX-ray: Hospital-Scale Chest X-ray Database and Benchmarks on Weakly Supervised Classification and Localization of Common Thorax Diseases , 2019, Deep Learning and Convolutional Neural Networks for Medical Imaging and Clinical Informatics.

[24]  Sendhil Mullainathan,et al.  Dissecting Racial Bias in an Algorithm that Guides Health Decisions for 70 Million People , 2019, FAT.

[25]  Suman V. Ravuri,et al.  A Clinically Applicable Approach to Continuous Prediction of Future Acute Kidney Injury , 2019, Nature.

[26]  D. Miglioretti,et al.  Radiographers supporting radiologists in the interpretation of screening mammography: a viable strategy to meet the shortage in the number of radiologists , 2015, BMC Cancer.

[27]  Clement J. McDonald,et al.  Preparing a collection of radiology examinations for distribution and retrieval , 2015, J. Am. Medical Informatics Assoc..

[28]  Hiroshi Honda,et al.  Current radiologist workload and the shortages in Japan: how many full-time radiologists are required? , 2015, Japanese Journal of Radiology.

[29]  Andrew Y. Ng,et al.  CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning , 2017, ArXiv.

[30]  J. Roemer,et al.  Equality of Opportunity , 2013 .

[31]  P. Lakhani,et al.  Deep Learning at Chest Radiography: Automated Classification of Pulmonary Tuberculosis by Using Convolutional Neural Networks. , 2017, Radiology.

[32]  Diane E. Hoffmann,et al.  The Girl Who Cried Pain: A Bias against Women in the Treatment of Pain , 2001, Journal of Law, Medicine & Ethics.

[33]  L. Campbell,et al.  The unequal burden of pain: confronting racial and ethnic disparities in pain. , 2003, Pain medicine.

[34]  Farzad Khalvati,et al.  Evaluating Knowledge Transfer In Neural Network for Medical Images , 2020, ArXiv.

[35]  Debashis Ghosh,et al.  A Simulation Based Dynamic Evaluation Framework for System-wide Algorithmic Fairness , 2019, ArXiv.

[36]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[37]  Alexandra Chouldechova,et al.  Fair prediction with disparate impact: A study of bias in recidivism prediction instruments , 2016, Big Data.

[38]  Abi Rimmer,et al.  Radiologist shortage leaves patient care at risk, warns royal college , 2017, British Medical Journal.

[39]  Nathan Srebro,et al.  Equality of Opportunity in Supervised Learning , 2016, NIPS.

[40]  Daniela Rus,et al.  Uncovering and Mitigating Algorithmic Bias through Learned Latent Structure , 2019, AIES.

[41]  Antonio Pertusa,et al.  PadChest: A large chest x-ray image dataset with multi-label annotated reports , 2019, Medical Image Anal..

[42]  S. Tamang,et al.  Potential Biases in Machine Learning Algorithms Using Electronic Health Record Data , 2018, JAMA internal medicine.

[43]  Marcus A. Badgeley,et al.  Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study , 2018, PLoS medicine.

[44]  Ichiro Kawachi,et al.  Health disparities by race and class: why both matter. , 2005, Health affairs.

[45]  Andreas Krause,et al.  Mathematical Notions vs. Human Perception of Fairness: A Descriptive Approach to Fairness for Machine Learning , 2019, KDD.

[46]  Li Yao,et al.  Learning to diagnose from scratch by exploiting dependencies among labels , 2017, ArXiv.

[47]  A. Ng,et al.  Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists , 2018, PLoS medicine.

[48]  Jon M. Kleinberg,et al.  Inherent Trade-Offs in the Fair Determination of Risk Scores , 2016, ITCS.

[49]  David Sontag,et al.  Why Is My Classifier Discriminatory? , 2018, NeurIPS.

[50]  E. Barrett-Connor,et al.  Sex/gender differences in cardiovascular disease prevention: what a difference a decade makes. , 2011, Circulation.

[51]  Alexandra Chouldechova,et al.  Bias in Bios: A Case Study of Semantic Representation Bias in a High-Stakes Setting , 2019, FAT.

[52]  Diego H. Milone,et al.  Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis , 2020, Proceedings of the National Academy of Sciences.

[53]  Andrew L. Beam,et al.  Practical guidance on artificial intelligence for health-care data. , 2019, The Lancet. Digital health.

[54]  Roger G. Mark,et al.  MIMIC-CXR: A large publicly available database of labeled chest radiographs , 2019, ArXiv.