A Clinician's Guide to Artificial Intelligence: How to Critically Appraise Machine Learning Studies

In recent years, there has been considerable interest in the prospect of machine learning models demonstrating expert-level diagnosis in multiple disease contexts. However, there is concern that the excitement around this field may be associated with inadequate scrutiny of methodology and insufficient adoption of scientific good practice in the studies involving artificial intelligence in health care. This article aims to empower clinicians and researchers to critically appraise studies of clinical applications of machine learning, through: (1) introducing basic machine learning concepts and nomenclature; (2) outlining key applicable principles of evidence-based medicine; and (3) highlighting some of the potential pitfalls in the design and reporting of these studies.

[1]  Livia Faes,et al.  Extension of the CONSORT and SPIRIT statements , 2019, The Lancet.

[2]  E. Topol,et al.  A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. , 2019, The Lancet. Digital health.

[3]  David Moher,et al.  Reporting guidelines for clinical trials evaluating artificial intelligence interventions are needed , 2019, Nature Medicine.

[4]  R. Hofmann-Wellenhof,et al.  Association Between Surgical Skin Markings in Dermoscopic Images and Diagnostic Performance of a Deep Learning Convolutional Neural Network for Melanoma Recognition. , 2019, JAMA dermatology.

[5]  Gary S. Collins,et al.  Reporting of artificial intelligence prediction models , 2019, The Lancet.

[6]  Georg Langs,et al.  Causability and explainability of artificial intelligence in medicine , 2019, WIREs Data Mining Knowl. Discov..

[7]  Rayid Ghani,et al.  Machine learning and AI research for Patient Benefit: 20 Critical Questions on Transparency, Replicability, Ethics and Effectiveness , 2018, ArXiv.

[8]  A. Ng,et al.  Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists , 2018, PLoS medicine.

[9]  A. Ng,et al.  Deep-learning-assisted diagnosis for knee magnetic resonance imaging: Development and retrospective validation of MRNet , 2018, PLoS medicine.

[10]  M. Abràmoff,et al.  Pivotal trial of an autonomous AI-based diagnostic system for detection of diabetic retinopathy in primary care offices , 2018, npj Digital Medicine.

[11]  Geraint Rees,et al.  Clinically applicable deep learning for diagnosis and referral in retinal disease , 2018, Nature Medicine.

[12]  Daniel S. Kermany,et al.  Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning , 2018, Cell.

[13]  S. Park,et al.  Methodologic Guide for Evaluating Clinical Performance and Effect of Artificial Intelligence Technology for Medical Diagnosis and Prediction. , 2018, Radiology.

[14]  T. Lancet Artificial intelligence in health care: within touching distance , 2018, The Lancet.

[15]  N. Shah,et al.  What This Computer Needs Is a Physician: Humanism and Artificial Intelligence. , 2018, Journal of the American Medical Association (JAMA).

[16]  A. Boss,et al.  Classification of breast cancer in ultrasound imaging using a generic deep learning analysis software: a pilot study. , 2017, The British journal of radiology.

[17]  C. E. Kahn From Images to Actions: Opportunities for Artificial Intelligence in Radiology. , 2017, Radiology.

[18]  Neil J. Joshi,et al.  Automated Grading of Age-Related Macular Degeneration From Color Fundus Images Using Deep Convolutional Neural Networks , 2017, JAMA ophthalmology.

[19]  Jérémie F. Cohen,et al.  Facilitating Prospective Registration of Diagnostic Accuracy Studies: A STARD Initiative. , 2017, Clinical chemistry.

[20]  Sebastian Thrun,et al.  Dermatologist-level classification of skin cancer with deep neural networks , 2017, Nature.

[21]  Z. Obermeyer,et al.  Predicting the Future - Big Data, Machine Learning, and Clinical Medicine. , 2016, The New England journal of medicine.

[22]  Pablo Villoslada,et al.  The APOSTEL recommendations for reporting quantitative optical coherence tomography studies , 2016, Neurology.

[23]  Ben Goldacre,et al.  COMPare Trials Project , 2016 .

[24]  G. Collins,et al.  New Guideline for the Reporting of Studies Developing, Validating, or Updating a Multivariable Clinical Prediction Model: The TRIPOD Statement , 2015, Advances in anatomic pathology.

[25]  Gary S Collins,et al.  Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): Explanation and Elaboration , 2015, Annals of Internal Medicine.

[26]  G. Collins,et al.  Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): The TRIPOD Statement , 2015, Annals of Internal Medicine.

[27]  R. Lewis,et al.  Minimal clinically important difference: defining what really matters to patients. , 2014, JAMA.

[28]  D. A. Korevaar,et al.  Publication and reporting of test accuracy studies registered in ClinicalTrials.gov. , 2014, Clinical chemistry.

[29]  G. Collins,et al.  External validation of multivariable prediction models: a systematic review of methodological conduct and reporting , 2014, BMC Medical Research Methodology.

[30]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[31]  Lucas M Bachmann,et al.  Multivariable adjustments counteract spectrum and test review bias in accuracy studies. , 2009, Journal of clinical epidemiology.

[32]  O. Miettinen,et al.  Towards scientific medicine: an information-age outlook. , 2008, Journal of evaluation in clinical practice.

[33]  D. Rennie,et al.  Towards complete and accurate reporting of studies of diagnostic accuracy: The STARD Initiative. , 2007, Annals of internal medicine.

[34]  Paul Glasziou,et al.  Comparative accuracy: assessing new tests against existing diagnostic pathways , 2006, BMJ : British Medical Journal.

[35]  J. Ioannidis,et al.  Why Most Published Research Findings Are False , 2005, PLoS medicine.

[36]  A. Walker,et al.  Improving the quality of reporting in randomised controlled trials. , 2004, Journal of wound care.

[37]  Lucas M Bachmann,et al.  Systematic reviews with individual patient data meta-analysis to evaluate diagnostic tests. , 2003, European journal of obstetrics, gynecology, and reproductive biology.

[38]  D. Rennie,et al.  Towards complete and accurate reporting of studies of diagnostic accuracy: the STARD initiative , 2003, BMJ : British Medical Journal.

[39]  P. Bossuyt,et al.  Empirical evidence of design-related bias in studies of diagnostic tests. , 1999, JAMA.

[40]  O. Miettinen,et al.  Evaluation of diagnostic imaging tests: diagnostic probability estimation. , 1998, Journal of clinical epidemiology.

[41]  I. Olkin,et al.  Improving the quality of reporting of randomized controlled trials. The CONSORT statement. , 1996, JAMA.

[42]  O. Miettinen,et al.  Foundations of medical diagnosis: what actually are the parameters involved in Bayes' theorem? , 1994, Statistics in medicine.

[43]  Laude,et al.  FEEDBACK ON A PUBLICLY DISTRIBUTED IMAGE DATABASE: THE MESSIDOR DATABASE , 2014 .

[44]  D. Moher,et al.  Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. , 2010, International journal of surgery.

[45]  Progression of retinopathy with intensive versus conventional treatment in the Diabetes Control and Complications Trial. Diabetes Control and Complications Trial Research Group. , 1995, Ophthalmology.