Machine learning and artificial intelligence research for patient benefit: 20 critical questions on transparency, replicability, ethics, and effectiveness

Machine learning, artificial intelligence, and other modern statistical methods are providing new opportunities to operationalise previously untapped and rapidly growing sources of data for patient benefit. Despite much promising research currently being undertaken, particularly in imaging, the literature as a whole lacks transparency, clear reporting to facilitate replicability, exploration for potential ethical concerns, and clear demonstrations of effectiveness. Among the many reasons why these problems exist, one of the most important (for which we provide a preliminary solution here) is the current lack of best practice guidance specific to machine learning and artificial intelligence. However, we believe that interdisciplinary groups pursuing research and impact projects involving machine learning and artificial intelligence for health would benefit from explicitly addressing a series of questions concerning transparency, reproducibility, ethics, and effectiveness (TREE). The 20 critical questions proposed here provide a framework for research groups to inform the design, conduct, and reporting; for editors and peer reviewers to evaluate contributions to the literature; and for patients, clinicians and policy makers to critically appraise where new findings may deliver patient benefit.

[1]  Christine L. Borgman,et al.  On the Reuse of Scientific Data , 2017, Data Sci. J..

[2]  Laura W. Harris,et al.  A standardized framework for representation of ancestry data in genomics studies, with application to the NHGRI-EBI GWAS Catalog , 2018, Genome Biology.

[3]  Dirk Timmerman,et al.  Predictive analytics in health care: how can we know it works? , 2019, J. Am. Medical Informatics Assoc..

[4]  Benjamin Neale,et al.  Genomic Risk Prediction of Coronary Artery Disease in 480,000 Adults Implications for Primary Prevention , 2019 .

[5]  R. Tibshirani,et al.  Increasing value and reducing waste in research design, conduct, and analysis , 2014, The Lancet.

[6]  Eugenio Culurciello,et al.  An Analysis of Deep Neural Network Models for Practical Applications , 2016, ArXiv.

[7]  John P. A. Ioannidis,et al.  Research: increasing value, reducing waste 2 , 2014 .

[8]  E. Steyerberg,et al.  Prognosis Research Strategy (PROGRESS) 3: Prognostic Model Research , 2013, PLoS medicine.

[9]  John P. A. Ioannidis,et al.  An empirical assessment of validation practices for molecular classifiers , 2011, Briefings Bioinform..

[10]  H. Pashler,et al.  Editors’ Introduction to the Special Section on Replicability in Psychological Science , 2012, Perspectives on psychological science : a journal of the Association for Psychological Science.

[11]  S. Astley,et al.  Single reading with computer-aided detection for screening mammography. , 2008, The New England journal of medicine.

[12]  Been Kim,et al.  Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.

[13]  A. P. Siva Kumar,et al.  Privacy preservation techniques in big data analytics: a survey , 2018, Journal of Big Data.

[14]  David Moher,et al.  Reducing waste from incomplete or unusable reports of biomedical research , 2014, The Lancet.

[15]  D. Curtis Polygenic risk score for schizophrenia is more strongly associated with ancestry than with schizophrenia , 2018, bioRxiv.

[16]  Iain Chalmers,et al.  How to increase value and reduce waste when research priorities are set , 2014, The Lancet.

[17]  Gary S Collins,et al.  Comparing risk prediction models , 2012, BMJ : British Medical Journal.

[18]  John P A Ioannidis,et al.  Diagnostic tests often fail to lead to changes in patient outcomes. , 2014, Journal of clinical epidemiology.

[19]  J. Norrie,et al.  Pragmatic Trials. , 2016, The New England journal of medicine.

[20]  Lei Liu,et al.  Detecting Dysglycemia Using the 2015 United States Preventive Services Task Force Screening Criteria: A Cohort Analysis of Community Health Center Patients , 2016, PLoS medicine.

[21]  Yvonne Vergouwe,et al.  Prognosis and prognostic research: validating a prognostic model , 2009, BMJ : British Medical Journal.

[22]  David Moher,et al.  The REporting of Studies Conducted Using Observational Routinely-Collected Health Data (RECORD) Statement: Methods for Arriving at Consensus and Developing Reporting Guidelines , 2015, PloS one.

[23]  Andrew Y. Ng,et al.  Improving palliative care with deep learning , 2017, 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[24]  Aiden R. Doherty,et al.  Statistical machine learning of sleep and physical activity phenotypes from sensor data in 96,609 UK Biobank participants , 2017, bioRxiv.

[25]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[26]  Stephen Walker,et al.  Nonparametric learning from Bayesian models with randomized objective functions , 2018, NeurIPS.

[27]  Helen Snooks,et al.  Effects and costs of implementing predictive risk stratification in primary care: a randomised stepped wedge trial , 2018, BMJ Quality & Safety.

[28]  Roger G. Mark,et al.  Reproducibility in critical care: a mortality prediction case study , 2017, MLHC.

[29]  Bruce Edmonds,et al.  The Aqua Book: Guidance on Producing Quality Analysis for Government by HM Treasury , 2016, J. Artif. Soc. Soc. Simul..

[30]  Irene Dankwa-Mullan,et al.  Concordance assessment of a cognitive computing system in Thailand. , 2017 .

[31]  Mohit Bhandari,et al.  What’s holding up the big data revolution in healthcare? , 2018, British Medical Journal.

[32]  Won-Suk Lee,et al.  Use of a cognitive computing system for treatment of colon and gastric cancer in South Korea. , 2017 .

[33]  Geraint Rees,et al.  Clinically applicable deep learning for diagnosis and referral in retinal disease , 2018, Nature Medicine.

[34]  S. Tamang,et al.  Potential Biases in Machine Learning Algorithms Using Electronic Health Record Data , 2018, JAMA internal medicine.

[35]  Jeffrey Dean,et al.  Scalable and accurate deep learning with electronic health records , 2018, npj Digital Medicine.

[36]  Stephen Evans,et al.  Medicines and Healthcare Products Regulatory Agency (MHRA) (Formerly MCA) , 2005 .

[37]  J. Ioannidis,et al.  External validation of new risk prediction models is infrequent and reveals worse prognostic discrimination. , 2015, Journal of clinical epidemiology.

[38]  J. Ioannidis,et al.  Assessment of claims of improved prediction beyond the Framingham risk score. , 2009, JAMA.

[39]  Julia Fu,et al.  Next steps for IBM Watson Oncology: Scalability to additional malignancies. , 2014 .

[40]  Y Ramya,et al.  Abstract S6-07: Double blinded validation study to assess performance of IBM artificial intelligence platform, Watson for oncology in comparison with Manipal multidisciplinary tumour board – First study of 638 breast cancer cases , 2017 .

[41]  Jie Ma,et al.  A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. , 2019, Journal of clinical epidemiology.

[42]  Yvonne Vergouwe,et al.  Prognosis and prognostic research: Developing a prognostic model , 2009, BMJ : British Medical Journal.

[43]  B. Frémont,et al.  SUMMARY OF SAFETY AND EFFECTIVENESS DATA , 2002 .

[44]  G. Collins,et al.  External validation of multivariable prediction models: a systematic review of methodological conduct and reporting , 2014, BMC Medical Research Methodology.

[45]  Igor F Tsigelny,et al.  Artificial intelligence in drug combination therapy , 2019, Briefings Bioinform..

[46]  Piotr Sliz,et al.  A Quick Guide to Software Licensing for the Scientist-Programmer , 2012, PLoS Comput. Biol..

[47]  Lawrence D. Jackel,et al.  Limits on Learning Machine Accuracy Imposed by Data Quality , 1995, KDD.

[48]  S. Uchino,et al.  Prediction Models and Their External Validation Studies for Mortality of Patients with Acute Kidney Injury: A Systematic Review , 2017, PloS one.

[49]  Patrick M M Bossuyt,et al.  Waste, Leaks, and Failures in the Biomarker Pipeline. , 2017, Clinical chemistry.

[50]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[51]  Lorenzo Strigini,et al.  How to Discriminate between Computer-Aided and Computer-Hindered Decisions , 2013, Medical decision making : an international journal of the Society for Medical Decision Making.

[52]  M. Ghassemi,et al.  Can AI Help Reduce Disparities in General Medical and Mental Health Care? , 2019, AMA journal of ethics.

[53]  John P. A. Ioannidis,et al.  What does research reproducibility mean? , 2016, Science Translational Medicine.

[54]  K. Covinsky,et al.  Assessing the Generalizability of Prognostic Information , 1999, Annals of Internal Medicine.

[55]  Elizabeth Ford,et al.  “Giving something back”: A systematic review and ethical enquiry into public views on the use of patient data for research in the United Kingdom and the Republic of Ireland , 2018, Wellcome open research.

[56]  T. Freer,et al.  Screening mammography with computer-aided detection: prospective study of 12,860 patients in a community breast center. , 2001, Radiology.

[57]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[58]  Cody Agnellutti Big data : an exploration of opportunities, values, and privacy issues , 2014 .

[59]  Ran D Balicer,et al.  External validation and comparison of three prediction tools for risk of osteoporotic fractures using data from population based electronic health records: retrospective cohort study , 2017, British Medical Journal.

[60]  Holger J. Schünemann,et al.  Identifying the PECO: A framework for formulating good questions to explore the association of environmental and other exposures with health outcomes. , 2018, Environment international.

[61]  Franz J. Király,et al.  NIPS - Not Even Wrong? A Systematic Review of Empirically Complete Demonstrations of Algorithmic Effectiveness in the Machine Learning and Artificial Intelligence Literature , 2018, ArXiv.

[62]  Rayid Ghani,et al.  Aequitas: A Bias and Fairness Audit Toolkit , 2018, ArXiv.

[63]  Rustam Al-Shahi Salman,et al.  Increasing value and reducing waste in biomedical research regulation and management , 2014, The Lancet.

[64]  Tom Fahey,et al.  Diagnostic accuracy of the STRATIFY clinical prediction rule for falls: A systematic review and meta-analysis , 2012, BMC Family Practice.

[65]  Gary S Collins,et al.  Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): Explanation and Elaboration , 2015, Annals of Internal Medicine.

[66]  P. Royston,et al.  Prognosis and prognostic research: application and impact of prognostic models in clinical practice , 2009, BMJ : British Medical Journal.

[67]  Matt J. Kusner,et al.  Counterfactual Fairness , 2017, NIPS.

[68]  John P. A. Ioannidis,et al.  How to Make More Published Research True , 2014, PLoS medicine.

[69]  Chris Russell,et al.  Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR , 2017, ArXiv.

[70]  Harlan M Krumholz,et al.  Increasing value and reducing waste: addressing inaccessible research , 2014, The Lancet.

[71]  Jon F. Claerbout,et al.  Electronic documents give reproducible research a new meaning: 62nd Ann , 1992 .

[72]  C. Lehman,et al.  Diagnostic Accuracy of Digital Screening Mammography With and Without Computer-Aided Detection. , 2015, JAMA internal medicine.

[73]  Richard D Riley,et al.  Prognosis research strategy (PROGRESS) 1: A framework for researching clinical outcomes , 2013, BMJ : British Medical Journal.