Individual predictions matter: Assessing the effect of data ordering in training fine-tuned CNNs for medical imaging

We reproduced the results of CheXNet with fixed hyperparameters and 50 different random seeds to identify 14 finding in chest radiographs (x-rays). Because CheXNet fine-tunes a pre-trained DenseNet, the random seed affects the ordering of the batches of training data but not the initialized model weights. We found substantial variability in predictions for the same radiograph across model runs (mean ln[(maximum probability)/(minimum probability)] 2.45, coefficient of variation 0.543). This individual radiograph-level variability was not fully reflected in the variability of AUC on a large test set. Averaging predictions from 10 models reduced variability by nearly 70% (mean coefficient of variation from 0.543 to 0.169, t-test 15.96, p-value < 0.0001). We encourage researchers to be aware of the potential variability of CNNs and ensemble predictions from multiple models to minimize the effect this variability may have on the care of individual patients when these models are deployed clinically.

[1]  Oriol Vinyals,et al.  Bayesian Recurrent Neural Networks , 2017, ArXiv.

[2]  L. Sugrue,et al.  Hybrid 3D/2D Convolutional Neural Network for Hemorrhage Evaluation on Head CT , 2018, American Journal of Neuroradiology.

[3]  Ronald M. Summers,et al.  TieNet: Text-Image Embedding Network for Common Thorax Disease Classification and Reporting in Chest X-Rays , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Jon M. Kleinberg,et al.  Direct Uncertainty Prediction for Medical Second Opinions , 2018, ICML.

[5]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Ananthram Swami,et al.  The Limitations of Deep Learning in Adversarial Settings , 2015, 2016 IEEE European Symposium on Security and Privacy (EuroS&P).

[7]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[8]  Marcus A. Badgeley,et al.  Automated deep-neural-network surveillance of cranial images for acute neurologic events , 2018, Nature Medicine.

[9]  Xavier Robin,et al.  pROC: an open-source package for R and S+ to analyze and compare ROC curves , 2011, BMC Bioinformatics.

[10]  A. Ng,et al.  Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists , 2018, PLoS medicine.

[11]  Subhashini Venugopalan,et al.  Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. , 2016, JAMA.

[12]  Mario Lucic,et al.  Are GANs Created Equal? A Large-Scale Study , 2017, NeurIPS.

[13]  Jaehoon Lee,et al.  On Empirical Comparisons of Optimizers for Deep Learning , 2019, ArXiv.

[14]  Michela Paganini,et al.  The Scientific Method in the Science of Machine Learning , 2019, ArXiv.

[15]  Andrew Y. Ng,et al.  CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning , 2017, ArXiv.

[16]  Chris Dyer,et al.  On the State of the Art of Evaluation in Neural Language Models , 2017, ICLR.

[17]  Sebastian Nowozin,et al.  Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift , 2019, NeurIPS.

[18]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[19]  Safwan S Halabi,et al.  Improving Automated Pediatric Bone Age Estimation Using Ensembles of Models from the 2017 RSNA Machine Learning Challenge. , 2019, Radiology. Artificial intelligence.

[20]  Ana Fernández-Somoano,et al.  The null hypothesis significance test in health sciences research (1995-2006): statistical analysis and interpretation , 2010, BMC medical research methodology.

[21]  Yifan Yu,et al.  CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison , 2019, AAAI.

[22]  Nathan Srebro,et al.  The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.

[23]  Samy Bengio,et al.  Transfusion: Understanding Transfer Learning with Applications to Medical Imaging , 2019, ArXiv.

[24]  Andrew L. Beam,et al.  Adversarial attacks on medical machine learning , 2019, Science.

[25]  Suchi Saria,et al.  Can You Trust This Prediction? Auditing Pointwise Reliability After Learning , 2019, AISTATS.

[26]  Jasper Snoek,et al.  Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling , 2018, ICLR.

[27]  Andrew Y. Ng,et al.  MURA Dataset: Towards Radiologist-Level Abnormality Detection in Musculoskeletal Radiographs , 2017, ArXiv.

[28]  Jeremy Nixon,et al.  Analyzing the role of model uncertainty for electronic health records , 2019, CHIL.

[29]  Simukayi Mutasa,et al.  MABAL: a Novel Deep-Learning Architecture for Machine-Assisted Bone Age Labeling , 2018, Journal of Digital Imaging.

[30]  Nan Wu,et al.  Deep Neural Networks Improve Radiologists’ Performance in Breast Cancer Screening , 2019, IEEE Transactions on Medical Imaging.

[31]  Roger G. Mark,et al.  MIMIC-CXR: A large publicly available database of labeled chest radiographs , 2019, ArXiv.

[32]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[33]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[34]  Marcus A. Badgeley,et al.  Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study , 2018, PLoS medicine.

[35]  D. Sculley,et al.  Hidden Technical Debt in Machine Learning Systems , 2015, NIPS.

[36]  Sebastian Thrun,et al.  Dermatologist-level classification of skin cancer with deep neural networks , 2017, Nature.

[37]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[38]  J Carpenter,et al.  Bootstrap confidence intervals: when, which, what? A practical guide for medical statisticians. , 2000, Statistics in medicine.

[39]  Rathachai Kaewlai,et al.  Abdominal and pelvic computed tomography (CT) interpretation: discrepancy rates among experienced radiologists , 2010, European Radiology.

[40]  Anders Krogh,et al.  Learning with ensembles: How overfitting can be useful , 1995, NIPS.

[41]  Gustavo Carneiro,et al.  Detecting hip fractures with radiologist-level performance using deep neural networks , 2017, ArXiv.

[42]  Ross Ihaka,et al.  Gentleman R: R: A language for data analysis and graphics , 1996 .

[43]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[44]  Ronald M. Summers,et al.  ChestX-ray: Hospital-Scale Chest X-ray Database and Benchmarks on Weakly Supervised Classification and Localization of Common Thorax Diseases , 2019, Deep Learning and Convolutional Neural Networks for Medical Imaging and Clinical Informatics.

[45]  Michael A. Bruno,et al.  Understanding and Confronting Our Mistakes: The Epidemiology of Error in Radiology and Strategies for Error Reduction. , 2015, Radiographics : a review publication of the Radiological Society of North America, Inc.