How I failed machine learning in medical imaging - shortcomings and recommendations

Medical imaging is an important research field with many opportunities for improving patients’ health. However, there are a number of challenges that are slowing down the progress of the field as a whole, such optimizing for publication. In this paper we reviewed several problems related to choosing datasets, methods, evaluation metrics, and publication strategies. With a review of literature and our own analysis, we show that at every step, potential biases can creep in. On a positive note, we also see that initiatives to counteract these problems are already being started. Finally we provide a broad range of recommendations on how to further these address problems in the future. For reproducibility, data and code for our analyses are available on https://github.com/GaelVaroquaux/ml med imaging failures.

[1]  Suchi Saria,et al.  Minimum information about clinical artificial intelligence modeling: the MI-CLAIM checklist , 2020, Nature Medicine.

[2]  Kaylie A. Carbine,et al.  Sample size calculations in human electrophysiology (EEG and ERP) studies: A systematic review and recommendations for increased rigor. , 2017, International journal of psychophysiology : official journal of the International Organization of Psychophysiology.

[3]  Benjamin Recht,et al.  Do ImageNet Classifiers Generalize to ImageNet? , 2019, ICML.

[4]  Michael Gao,et al.  A Path for Translation of Machine Learning Products into Healthcare Delivery , 2020, EMJ Innovations.

[5]  Gael Varoquaux,et al.  Establishment of Best Practices for Evidence for Prediction: A Review. , 2019, JAMA psychiatry.

[6]  Alejandro F. Frangi,et al.  Is the winner really the best? A critical analysis of common research practice in biomedical image analysis competitions , 2018, ArXiv.

[7]  P. Anderberg,et al.  Machine learning and microsimulation techniques on the prognosis of dementia: A systematic literature review , 2017, PloS one.

[8]  Jake VanderPlas,et al.  A Practical Taxonomy of Reproducibility for Machine Learning Research , 2018 .

[9]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[10]  R. D'Agostino,et al.  Non‐inferiority trials: design concepts and issues – the encounters of academic consultants in statistics , 2002, Statistics in medicine.

[11]  O. Colliot,et al.  Predicting the Progression of Mild Cognitive Impairment Using Machine Learning: A Systematic, Quantitative and Critical Review , 2020, medRxiv.

[12]  Harini Suresh,et al.  A Framework for Understanding Unintended Consequences of Machine Learning , 2019, ArXiv.

[13]  R. Hofmann-Wellenhof,et al.  Association Between Surgical Skin Markings in Dermoscopic Images and Diagnostic Performance of a Deep Learning Convolutional Neural Network for Melanoma Recognition. , 2019, JAMA dermatology.

[14]  Michela Paganini,et al.  The Scientific Method in the Science of Machine Learning , 2019, ArXiv.

[15]  Senén Barro,et al.  Do we need hundreds of classifiers to solve real world classification problems? , 2014, J. Mach. Learn. Res..

[16]  Gary Marcus,et al.  Deep Learning: A Critical Appraisal , 2018, ArXiv.

[17]  Dietmar Jannach,et al.  Are we really making much progress? A worrying analysis of recent neural recommendation approaches , 2019, RecSys.

[18]  X Yu,et al.  Classify epithelium‐stroma in histopathological images based on deep transferable network , 2018, Journal of microscopy.

[19]  R. Rosenthal The file drawer problem and tolerance for null results , 1979 .

[20]  J. Ioannidis Why Most Published Research Findings Are False , 2005, PLoS medicine.

[21]  Mohak Shah,et al.  Performance Evaluation in Machine Learning , 2015 .

[22]  Colin Raffel,et al.  Realistic Evaluation of Semi-Supervised Learning Algorithms , 2018, ICLR.

[23]  Zachary C. Lipton,et al.  Troubling Trends in Machine Learning Scholarship , 2018, ACM Queue.

[24]  Tolga Tasdizen,et al.  Improving the robustness of convolutional networks to appearance variability in biomedical images , 2018, 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018).

[25]  Odd Erik Gundersen,et al.  State of the Art: Reproducibility in Artificial Intelligence , 2018, AAAI.

[26]  Oliver Zendel,et al.  How Good Is My Test Data? Introducing Safety Analysis for Computer Vision , 2017, International Journal of Computer Vision.

[27]  Vince D. Calhoun,et al.  Single subject prediction of brain disorders in neuroimaging: Promises and pitfalls , 2017, NeuroImage.

[28]  Moritz Hardt,et al.  A Meta-Analysis of Overfitting in Machine Learning , 2019, NeurIPS.

[29]  Benjamin Recht,et al.  Evaluating Machine Accuracy on ImageNet , 2020, ICML.

[30]  Inioluwa Deborah Raji,et al.  Model Cards for Model Reporting , 2018, FAT.

[31]  Bokai WANG,et al.  Comparisons of Superiority, Non-inferiority, and Equivalence Trials , 2017, Shanghai archives of psychiatry.

[32]  D. Sculley,et al.  Hidden Technical Debt in Machine Learning Systems , 2015, NIPS.

[33]  C. Jack,et al.  Ways toward an early diagnosis in Alzheimer’s disease: The Alzheimer’s Disease Neuroimaging Initiative (ADNI) , 2005, Alzheimer's & Dementia.

[34]  Christian Wachinger,et al.  Detect, Quantify, and Incorporate Dataset Bias: A Neuroimaging Analysis on 12, 207 Individuals , 2018, ArXiv.

[35]  Erik Christensen,et al.  Methodology of superiority vs. equivalence trials and non-inferiority trials. , 2007, Journal of hepatology.

[36]  Ser-Nam Lim,et al.  A Metric Learning Reality Check , 2020, ECCV.

[37]  Frank E. Harrell,et al.  Prediction models need appropriate internal, internal-external, and external validation. , 2016, Journal of clinical epidemiology.

[38]  Andrew Doyle,et al.  A Survey of Crowdsourcing in Medical Image Analysis , 2019, Hum. Comput..

[39]  Samaneh Abbasi-Sureshjani,et al.  Risk of Training Diagnostic Algorithms on Data with Demographic Bias , 2020, iMIMIC/MIL3iD/LABELS@MICCAI.

[40]  Kei Yamada,et al.  Machine learning studies on major brain diseases: 5-year trends of 2014–2018 , 2018, Japanese Journal of Radiology.

[41]  Alexei A. Efros,et al.  Unbiased look at dataset bias , 2011, CVPR 2011.

[42]  Taghi M. Khoshgoftaar,et al.  Sample size determination for biomedical big data with limited labels , 2020, Network Modeling Analysis in Health Informatics and Bioinformatics.

[43]  Philipp Kellmeyer,et al.  Ethical and Legal Implications of the Methodological Crisis in Neuroimaging , 2017, Cambridge Quarterly of Healthcare Ethics.

[44]  Carl Gutwin,et al.  Threats of a replication crisis in empirical computer science , 2020, Commun. ACM.

[45]  Anton van den Hengel,et al.  On the Value of Out-of-Distribution Testing: An Example of Goodhart's Law , 2020, NeurIPS.

[46]  Ronald M. Summers,et al.  A Review of Deep Learning in Medical Imaging: Imaging Traits, Technology Trends, Case Studies With Progress Highlights, and Future Promises , 2020, Proceedings of the IEEE.

[47]  Kiri Wagstaff,et al.  Machine Learning that Matters , 2012, ICML.

[48]  John P. A. Ioannidis,et al.  Sample size evolution in neuroimaging research: an evaluation of highly-cited studies (1990-2012) and of latest practices (2017-2018) in high-impact journals , 2019, NeuroImage.

[49]  S. Park,et al.  Methodologic Guide for Evaluating Clinical Performance and Effect of Artificial Intelligence Technology for Medical Diagnosis and Prediction. , 2018, Radiology.

[50]  Luca Foschini,et al.  Reproducibility in Machine Learning for Health , 2019, RML@ICLR.

[51]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[52]  Eric J Topol,et al.  High-performance medicine: the convergence of human and artificial intelligence , 2019, Nature Medicine.

[53]  Torsten Rohlfing,et al.  Image Similarity and Tissue Overlaps as Surrogates for Image Registration Accuracy: Widely Used but Unreliable , 2012, IEEE Transactions on Medical Imaging.

[54]  Agatha Lenartowicz,et al.  Classification Accuracy of Neuroimaging Biomarkers in Attention-Deficit/Hyperactivity Disorder: Effects of Sample Size and Circular Analysis. , 2019, Biological psychiatry. Cognitive neuroscience and neuroimaging.

[55]  Rodrigo C. Barros,et al.  Can we trust deep learning models diagnosis? The impact of domain shift in chest radiograph classification , 2019, TIA@MICCAI.

[56]  Ali Sunyaev,et al.  What Your Radiologist Might be Missing: Using Machine Learning to Identify Mislabeled Instances of X-ray Images , 2021, HICSS.

[57]  Timnit Gebru,et al.  Datasheets for datasets , 2018, Commun. ACM.

[58]  Leo Celi,et al.  Evaluating Progress on Machine Learning for Longitudinal Electronic Healthcare Data , 2020, ArXiv.

[59]  J. Popp,et al.  Sample size planning for classification models. , 2012, Analytica chimica acta.

[60]  L. HARKing: Hypothesizing After the Results are Known , 2002 .

[61]  Howard Bowman,et al.  I tried a bunch of things: The dangers of unexpected overfitting in classification of brain data , 2020, Neuroscience and Biobehavioral Reviews.

[62]  Gustavo Carneiro,et al.  Hidden stratification causes clinically meaningful failures in machine learning for medical imaging , 2019, CHIL.

[63]  Daniel Berrar,et al.  Confidence curves: an alternative to null hypothesis significance testing for the comparison of classifiers , 2017, Machine Learning.

[64]  Ali Borji,et al.  Negative results in computer vision: A perspective , 2017, Image Vis. Comput..

[65]  Tal Arbel,et al.  Accounting for Variance in Machine Learning Benchmarks , 2021, MLSys.

[66]  Arturo Casadevall,et al.  Increasing disparities between resource inputs and outcomes, as measured by certain health deliverables, in biomedical research , 2015, Proceedings of the National Academy of Sciences.

[67]  Gerd Gigerenzer,et al.  Statistical Rituals: The Replication Delusion and How We Got There , 2018, Advances in Methods and Practices in Psychological Science.

[68]  Tapio Salakoski,et al.  A comparison of AUC estimators in small-sample studies , 2009, MLSB.

[69]  M. Lungren,et al.  Preparing Medical Imaging Data for Machine Learning. , 2020, Radiology.

[70]  Gaël Varoquaux,et al.  Cross-validation failure: Small sample sizes lead to large error bars , 2017, NeuroImage.

[71]  Marcus A. Badgeley,et al.  Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study , 2018, PLoS medicine.

[72]  David Uminsky,et al.  Reliance on metrics is a fundamental challenge for AI , 2020, Patterns.

[73]  Francesca Mangili,et al.  Should We Really Use Post-Hoc Tests Based on Mean-Ranks? , 2015, J. Mach. Learn. Res..

[74]  Pascal Vincent,et al.  Unreproducible Research is Reproducible , 2019, ICML.

[75]  Raghavendra Selvan,et al.  Carbontracker: Tracking and Predicting the Carbon Footprint of Training Deep Learning Models , 2020, ArXiv.

[76]  Byoung Wook Choi,et al.  How to Develop, Validate, and Compare Clinical Prediction Models Involving Radiological Parameters: Study Design and Statistical Methods , 2016, Korean journal of radiology.

[77]  Stephan Günnemann,et al.  Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift , 2018, NeurIPS.

[78]  J. Schumi,et al.  TRIALS REVIEW Open Access , 2022 .

[79]  Peter Henderson,et al.  Towards the Systematic Reporting of the Energy and Carbon Footprints of Machine Learning , 2020, ArXiv.

[80]  Heikki Huttunen,et al.  HARK Side of Deep Learning - From Grad Student Descent to Automated Machine Learning , 2019, ArXiv.

[81]  Diego H. Milone,et al.  Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis , 2020, Proceedings of the National Academy of Sciences.

[82]  Iñaki Inza,et al.  Dealing with the evaluation of supervised classification algorithms , 2015, Artificial Intelligence Review.

[83]  Shehroz S. Khan,et al.  Learning to Unlearn: Building Immunity to Dataset Bias in Medical Imaging Studies , 2018, ArXiv.

[84]  Matthias Bethge,et al.  Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet , 2019, ICLR.

[85]  Phillip M Cheng,et al.  Artificial Intelligence for Medical Image Analysis: A Guide for Authors and Reviewers. , 2019, AJR. American journal of roentgenology.

[86]  Ninon Burgos,et al.  Convolutional Neural Networks for Classification of Alzheimer's Disease: Overview and Reproducible Evaluation , 2019, Medical Image Anal..

[87]  Kellyn F Arnold,et al.  Time to reality check the promises of machine learning-powered precision medicine , 2020, The Lancet. Digital health.

[88]  Bram van Ginneken,et al.  A survey on deep learning in medical image analysis , 2017, Medical Image Anal..

[89]  Luke Oakden-Rayner,et al.  Exploring large scale public medical image datasets , 2019, Academic radiology.

[90]  L. Joskowicz,et al.  Inter-observer variability of manual contour delineation of structures in CT , 2018, European Radiology.

[91]  Aaron Carass,et al.  Why rankings of biomedical image analysis competitions should be interpreted with care , 2018, Nature Communications.