Empirical assessment of bias in machine learning diagnostic test accuracy studies

OBJECTIVE Machine learning (ML) diagnostic tools have significant potential to improve health care. However, methodological pitfalls may affect diagnostic test accuracy studies used to appraise such tools. We aimed to evaluate the prevalence and reporting of design characteristics within the literature. Further, we sought to empirically assess whether design features may be associated with different estimates of diagnostic accuracy. MATERIALS AND METHODS We systematically retrieved 2 × 2 tables (n = 281) describing the performance of ML diagnostic tools, derived from 114 publications in 38 meta-analyses, from PubMed. Data extracted included test performance, sample sizes, and design features. A mixed-effects metaregression was run to quantify the association between design features and diagnostic accuracy. RESULTS Participant ethnicity and blinding in test interpretation was unreported in 90% and 60% of studies, respectively. Reporting was occasionally lacking for rudimentary characteristics such as study design (28% unreported). Internal validation without appropriate safeguards was used in 44% of studies. Several design features were associated with larger estimates of accuracy, including having unreported (relative diagnostic odds ratio [RDOR], 2.11; 95% confidence interval [CI], 1.43-3.1) or case-control study designs (RDOR, 1.27; 95% CI, 0.97-1.66), and recruiting participants for the index test (RDOR, 1.67; 95% CI, 1.08-2.59). DISCUSSION Significant underreporting of experimental details was present. Study design features may affect estimates of diagnostic performance in the ML diagnostic test accuracy literature. CONCLUSIONS The present study identifies pitfalls that threaten the validity, generalizability, and clinical value of ML diagnostic tools and provides recommendations for improvement.

[1]  Eric J Topol,et al.  High-performance medicine: the convergence of human and artificial intelligence , 2019, Nature Medicine.

[2]  Marius E Mayerhoefer,et al.  Are signal intensity and homogeneity useful parameters for distinguishing between benign and malignant soft tissue masses on MR images? Objective evaluation by means of texture analysis. , 2008, Magnetic resonance imaging.

[3]  Johannes B Reitsma,et al.  Evidence of bias and variation in diagnostic accuracy studies , 2006, Canadian Medical Association Journal.

[4]  Akbar K Waljee,et al.  Machine Learning in Medicine: A Primer for Physicians , 2010, The American Journal of Gastroenterology.

[5]  J. Philbrick,et al.  The d-dimer test for deep venous thrombosis: gold standards and bias in negative predictive value. , 2003, Clinical chemistry.

[6]  William J Catalona,et al.  Effect of verification bias on screening for prostate cancer by measurement of prostate-specific antigen. , 2003, The New England journal of medicine.

[7]  E. Topol,et al.  A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. , 2019, The Lancet. Digital health.

[8]  L E Moses,et al.  Combining independent studies of a diagnostic test into a summary ROC curve: data-analytic approaches and some additional considerations. , 1993, Statistics in medicine.

[9]  S. Tamang,et al.  Potential Biases in Machine Learning Algorithms Using Electronic Health Record Data , 2018, JAMA internal medicine.

[10]  Johannes B Reitsma,et al.  STARD 2015 guidelines for reporting diagnostic accuracy studies: explanation and elaboration , 2016, BMJ Open.

[11]  K. Borgwardt,et al.  Machine Learning in Medicine , 2015, Mach. Learn. under Resour. Constraints Vol. 3.

[12]  Jong Hyo Kim,et al.  Multilevel analysis of spatiotemporal association features for differentiation of tumor enhancement patterns in breast DCE-MRI. , 2010, Medical physics.

[13]  Nicole Wenderoth,et al.  Promises, Pitfalls, and Basic Guidelines for Applying Machine Learning Classifiers to Psychiatric Imaging Data, with Autism as an Example , 2016, Front. Psychiatry.

[14]  K. Kagan,et al.  Fetal nasal bone in screening for trisomies 21, 18 and 13 and Turner syndrome at 11–13 weeks of gestation , 2009, Ultrasound in obstetrics & gynecology : the official journal of the International Society of Ultrasound in Obstetrics and Gynecology.

[15]  Li Li,et al.  Comparative analyses of population-scale phenomic data in electronic medical records reveal race-specific disease networks , 2016, Bioinform..

[16]  Matthew S. Goodwin,et al.  Applying Machine Learning to Facilitate Autism Diagnostics: Pitfalls and Promises , 2014, Journal of Autism and Developmental Disorders.

[17]  C. Estrada,et al.  Reporting and concordance of methodologic criteria between abstracts and articles in diagnostic test studies , 2000, Journal of General Internal Medicine.

[18]  Jan Sijbers,et al.  Machine learning study of several classifiers trained with texture analysis features to differentiate benign from malignant soft‐tissue tumors in T1‐MRI images , 2010, Journal of magnetic resonance imaging : JMRI.

[19]  A. D'Agata,et al.  Maternal serum screening for Down's syndrome in the first trimester of pregnancy , 1995, British journal of obstetrics and gynaecology.

[20]  Sebastian Thrun,et al.  Dermatologist-level classification of skin cancer with deep neural networks , 2017, Nature.

[21]  Thomas G. Dietterich Overfitting and undercomputing in machine learning , 1995, CSUR.

[22]  J. Neilson,et al.  First trimester serum tests for Down's syndrome screening. , 2015, The Cochrane database of systematic reviews.

[23]  P Abdolmaleki,et al.  Neural network analysis of breast cancer from MRI findings. , 1997, Radiation medicine.

[24]  A R Feinstein,et al.  Use of methodological standards in diagnostic test research. Getting better but still not good. , 1995, JAMA.

[25]  S. Park,et al.  Design Characteristics of Studies Reporting the Performance of Artificial Intelligence Algorithms for Diagnostic Analysis of Medical Images: Results from Recently Published Papers , 2019, Korean journal of radiology.

[26]  Max A. Little,et al.  Machine learning for large‐scale wearable sensor data in Parkinson's disease: Concepts, promises, pitfalls, and futures , 2016, Movement disorders : official journal of the Movement Disorder Society.

[27]  James H Thrall,et al.  Artificial Intelligence and Machine Learning in Radiology: Opportunities, Challenges, Pitfalls, and Criteria for Success. , 2018, Journal of the American College of Radiology : JACR.

[28]  Gavin C. Cawley,et al.  On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation , 2010, J. Mach. Learn. Res..

[29]  Igor Kononenko,et al.  Machine learning for medical diagnosis: history, state of the art and perspective , 2001, Artif. Intell. Medicine.

[30]  Masoumeh Haghpanahi,et al.  Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network , 2019, Nature Medicine.

[31]  L. Bassett,et al.  Multifeature analysis of Gd‐enhanced MR images of breast lesions , 1997, Journal of magnetic resonance imaging : JMRI.

[32]  Susan Mallett,et al.  A systematic review classifies sources of bias and variation in diagnostic test accuracy studies. , 2013, Journal of clinical epidemiology.

[33]  Jie Ma,et al.  A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. , 2019, Journal of clinical epidemiology.

[34]  R. Harper,et al.  Compliance with methodological standards when evaluating ophthalmic diagnostic tests. , 1999, Investigative ophthalmology & visual science.

[35]  E. Setti,et al.  A use of a neural network to evaluate contrast enhancement curves in breast magnetic resonance images , 2001, Journal of Digital Imaging.

[36]  P Abdolmaleki,et al.  Feature extraction and classification of breast cancer on dynamic magnetic resonance imaging using artificial neural network. , 2001, Cancer letters.

[37]  J. Moutquin,et al.  Screening for Down syndrome during first trimester: a prospective study using free beta-human chorionic gonadotropin and pregnancy-associated plasma protein A. , 1997, Clinical biochemistry.

[38]  P. Bossuyt,et al.  Empirical evidence of design-related bias in studies of diagnostic tests. , 1999, JAMA.

[39]  Isaac S Kohane,et al.  Artificial Intelligence in Healthcare , 2019, Artificial Intelligence and Machine Learning for Business for Non-Engineers.

[40]  J. Cutler,et al.  Trends and disparities in coronary heart disease, stroke, and other cardiovascular diseases in the United States: findings of the national conference on cardiovascular disease prevention. , 2000, Circulation.

[41]  Carlo Sansone,et al.  Pattern Recognition Approaches for Breast Cancer DCE-MRI Classification: A Systematic Review , 2016, Journal of Medical and Biological Engineering.

[42]  R. Morris,et al.  Methodological quality of test accuracy studies included in systematic reviews in obstetrics and gynaecology: sources of bias , 2011, BMC women's health.