Confounding variables can degrade generalization performance of radiological deep learning models

Background: There is interest in using convolutional neural networks (CNNs) to analyze medical imaging to provide computer aided diagnosis (CAD). Recent work has suggested that image classification CNNs may not generalize to new data as well as previously believed. We assessed how well CNNs generalized across three hospital systems for a simulated pneumonia screening task. Methods and Findings: A cross-sectional design with multiple model training cohorts was used to evaluate model generalizability to external sites. 158,323 chest radiographs were drawn from three institutions: NIH (112,120 from 30,805 patients), Mount Sinai Hospital (MSH; 42,396 from 12,904 patients), July 16, 2018 1/15 ar X iv :1 80 7. 00 43 1v 2 [ cs .C V ] 1 3 Ju l 2 01 8 and Indiana (IU; 3,807 radiographs from 3,683 patients). These patient populations had age mean (S.D.) 46.9 (16.6), 63.2 (16.5), and 49.6 (17), and percent female 43.5%, 44.8%, and 57.1%, respectively. We assessed individual models using area under the receiver operating characteristic curve (AUC) for radiographic findings consistent with pneumonia and compared performance on different test sets with DeLong’s test. The prevalence of pneumonia was high enough at MSH (34.2%) relative to NIH and IU (1.2% and 1.0%) that merely sorting by hospital system achieved an AUC of 0.861 on the joint MSH-NIH dataset. Models trained on data from either NIH or MSH had equivalent performance on IU (p-values 0.580 and 0.273, respectively) and inferior performance on data from each other relative to an internal test set (i.e., new data from within the hospital system used for training data; p-values both < 0.001). The highest internal performance was achieved by combining training and test data from MSH and NIH (AUC 0.931, 95% C.I. 0.927-0.936), but this model demonstrated significantly lower external performance at IU (AUC 0.815, 95% C.I. 0.745-0.885, P = 0.001). To test the effect of pooling data from sites with disparate pneumonia prevalence, we used stratified subsampling to generate MSH-NIH cohorts that only differed in disease prevalence between training data sites. When both training data sites had the same pneumonia prevalence, the model performed consistently on external IU data (P = 0.88). When a ten-fold difference in pneumonia rate was introduced between sites, internal test performance improved compared to the balanced model (10x MSH risk P < 0.001; 10x NIH P = 0.002), but this outperformance failed to generalize to IU (MSH 10x P < 0.001; NIH 10x P = 0.027). CNNs were able to directly detect hospital system of a radiograph for 99.95% NIH (22,050/22,062) and 99.98% MSH (8,386/8,388) radiographs. The primary limitation of our approach and the available public data is that we cannot fully assess what other factors might be contributing to hospital system-specific biases. Conclusions: Pneumonia screening CNNs achieved better internal than external performance in 3 / 5 natural comparisons. When models were trained on pooled data from sites with different pneumonia prevalence, they performed better on new pooled data from these sites but not on external data. CNNs robustly identified hospital system and department within a hospital which can have large differences in disease burden and may confound disease predictions.

[1]  Marcus A. Badgeley,et al.  Natural Language-based Machine Learning Models for the Annotation of Clinical Radiology Reports. , 2018, Radiology.

[2]  N. Pandis,et al.  CONSORT 2010 statement: extension checklist for reporting within person randomised trials , 2017, British Medical Journal.

[3]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[4]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[5]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[6]  P. Rothwell,et al.  External validity of randomised controlled trials: “To whom do the results of this trial apply?” , 2005, The Lancet.

[7]  F. Cabitza,et al.  Unintended Consequences of Machine Learning in Medicine , 2017, JAMA.

[8]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Benjamin Recht,et al.  Do CIFAR-10 Classifiers Generalize to CIFAR-10? , 2018, ArXiv.

[10]  Subhashini Venugopalan,et al.  Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. , 2016, JAMA.

[11]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[12]  N. Black CONSORT , 1996, The Lancet.

[13]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[15]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[16]  Kyunghyun Cho,et al.  High-Resolution Breast Cancer Screening with Multi-View Deep Convolutional Neural Networks , 2017, ArXiv.

[17]  Daniel S. Kermany,et al.  Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning , 2018, Cell.

[18]  Ronald M. Summers,et al.  ChestX-ray: Hospital-Scale Chest X-ray Database and Benchmarks on Weakly Supervised Classification and Localization of Common Thorax Diseases , 2019, Deep Learning and Convolutional Neural Networks for Medical Imaging and Clinical Informatics.

[19]  E. Finkelstein,et al.  Development and Validation of a Deep Learning System for Diabetic Retinopathy and Related Eye Diseases Using Retinal Images From Multiethnic Populations With Diabetes , 2017, JAMA.

[20]  Andrew Y. Ng,et al.  CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning , 2017, ArXiv.

[21]  J. Bartlett,et al.  Infectious Diseases Society of America/American Thoracic Society Consensus Guidelines on the Management of Community-Acquired Pneumonia in Adults , 2007, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[22]  Clement J. McDonald,et al.  Preparing a collection of radiology examinations for distribution and retrieval , 2015, J. Am. Medical Informatics Assoc..

[23]  Gustavo Carneiro,et al.  Detecting hip fractures with radiologist-level performance using deep neural networks , 2017, ArXiv.