Explaining medical AI performance disparities across sites with confounder Shapley value analysis

Medical AI algorithms can often experience degraded performance when evaluated on previously unseen sites. Addressing cross-site performance disparities is key to ensuring that AI is equitable and effective when deployed on diverse patient populations. Multi-site evaluations are key to diagnosing such disparities as they can test algorithms across a broader range of potential biases such as patient demographics, equipment types, and technical parameters. However, such tests do not explain why the model performs worse. Our framework provides a method for quantifying the marginal and cumulative effect of each type of bias on the overall performance difference when a model is evaluated on external data. We demonstrate its usefulness in a case study of a deep learning model trained to detect the presence of pneumothorax, where our framework can help explain up to 60% of the discrepancy in performance across different sites with known biases like disease comorbidities and imaging parameters.

[1]  Yong Hu,et al.  Cross-site transportability of an explainable artificial intelligence model for acute kidney injury prediction , 2020, Nature Communications.

[2]  Yasmeen Hitti,et al.  Proposed Taxonomy for Gender Bias in Text; A Filtering Methodology for the Gender Generalization Subtype , 2019, Proceedings of the First Workshop on Gender Bias in Natural Language Processing.

[3]  Yifan Yu,et al.  CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison , 2019, AAAI.

[4]  Diego H. Milone,et al.  Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis , 2020, Proceedings of the National Academy of Sciences.

[5]  Anant Madabhushi,et al.  Empirical evaluation of cross-site reproducibility in radiomic features for characterizing prostate MRI , 2018, Medical Imaging.

[6]  Daniela Oelke,et al.  Understanding Bias in Machine Learning , 2019, ArXiv.

[7]  Håvard D. Johansen,et al.  An Extensive Study on Cross-Dataset Bias and Evaluation Metrics Interpretation for Machine Learning Applied to Gastrointestinal Tract Abnormality Classification , 2020, ACM Trans. Comput. Heal..

[8]  M. Lungren,et al.  Preparing Medical Imaging Data for Machine Learning. , 2020, Radiology.

[9]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  James Zou,et al.  A Distributional Framework for Data Valuation , 2020, ICML.

[11]  Saptarshi Purkayastha,et al.  AI recognition of patient race in medical imaging: a modelling study , 2021, The Lancet. Digital health.

[12]  S. Tamang,et al.  Potential Biases in Machine Learning Algorithms Using Electronic Health Record Data , 2018, JAMA internal medicine.

[13]  N. Shah,et al.  Implementing Machine Learning in Health Care - Addressing Ethical Challenges. , 2018, The New England journal of medicine.

[14]  Curt P. Langlotz,et al.  Geographic Distribution of US Cohorts Used to Train Deep Learning Algorithms. , 2020, JAMA.

[15]  Ronald M. Summers,et al.  ChestX-ray: Hospital-Scale Chest X-ray Database and Benchmarks on Weakly Supervised Classification and Localization of Common Thorax Diseases , 2019, Deep Learning and Convolutional Neural Networks for Medical Imaging and Clinical Informatics.

[16]  James Y. Zou,et al.  Data Shapley: Equitable Valuation of Data for Machine Learning , 2019, ICML.

[17]  Daniel E. Ho,et al.  How medical AI devices are evaluated: limitations and recommendations from an analysis of FDA approvals , 2021, Nature Medicine.

[18]  Kristina Lerman,et al.  A Survey on Bias and Fairness in Machine Learning , 2019, ACM Comput. Surv..

[19]  Francesco Renna,et al.  On instabilities of deep learning in image reconstruction and the potential costs of AI , 2019, Proceedings of the National Academy of Sciences.