Machine learning and AI research for Patient Benefit: 20 Critical Questions on Transparency, Replicability, Ethics and Effectiveness

Machine learning (ML), artificial intelligence (AI) and other modern statistical methods are providing new opportunities to operationalize previously untapped and rapidly growing sources of data for patient benefit. Whilst there is a lot of promising research currently being undertaken, the literature as a whole lacks: transparency; clear reporting to facilitate replicability; exploration for potential ethical concerns; and, clear demonstrations of effectiveness. There are many reasons for why these issues exist, but one of the most important that we provide a preliminary solution for here is the current lack of ML/AI- specific best practice guidance. Although there is no consensus on what best practice looks in this field, we believe that interdisciplinary groups pursuing research and impact projects in the ML/AI for health domain would benefit from answering a series of questions based on the important issues that exist when undertaking work of this nature. Here we present 20 questions that span the entire project life cycle, from inception, data analysis, and model evaluation, to implementation, as a means to facilitate project planning and post-hoc (structured) independent evaluation. By beginning to answer these questions in different settings, we can start to understand what constitutes a good answer, and we expect that the resulting discussion will be central to developing an international consensus framework for transparent, replicable, ethical and effective research in artificial intelligence (AI-TREE) for health.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Been Kim,et al.  Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.

[3]  Tom Fahey,et al.  Diagnostic accuracy of the STRATIFY clinical prediction rule for falls: A systematic review and meta-analysis , 2012, BMC Family Practice.

[4]  Gary S Collins,et al.  Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): Explanation and Elaboration , 2015, Annals of Internal Medicine.

[5]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[6]  Gary S Collins,et al.  Comparing risk prediction models , 2012, BMJ : British Medical Journal.

[7]  Stephen Walker,et al.  Nonparametric learning from Bayesian models with randomized objective functions , 2018, NeurIPS.

[8]  David Moher,et al.  Reducing waste from incomplete or unusable reports of biomedical research , 2014, The Lancet.

[9]  Roger G. Mark,et al.  Reproducibility in critical care: a mortality prediction case study , 2017, MLHC.

[10]  D. Curtis Polygenic risk score for schizophrenia is more strongly associated with ancestry than with schizophrenia , 2018, bioRxiv.

[11]  John P A Ioannidis,et al.  Diagnostic tests often fail to lead to changes in patient outcomes. , 2014, Journal of clinical epidemiology.

[12]  E. Steyerberg,et al.  Prognosis Research Strategy (PROGRESS) 3: Prognostic Model Research , 2013, PLoS medicine.

[13]  John P. A. Ioannidis,et al.  An empirical assessment of validation practices for molecular classifiers , 2011, Briefings Bioinform..

[14]  Yvonne Vergouwe,et al.  Prognosis and prognostic research: validating a prognostic model , 2009, BMJ : British Medical Journal.

[15]  Laura W. Harris,et al.  A standardized framework for representation of ancestry data in genomics studies, with application to the NHGRI-EBI GWAS Catalog , 2018, Genome Biology.

[16]  Franz J. Király,et al.  NIPS - Not Even Wrong? A Systematic Review of Empirically Complete Demonstrations of Algorithmic Effectiveness in the Machine Learning and Artificial Intelligence Literature , 2018, ArXiv.

[17]  S. Astley,et al.  Single reading with computer-aided detection for screening mammography. , 2008, The New England journal of medicine.

[18]  Richard D Riley,et al.  Prognosis research strategy (PROGRESS) 1: A framework for researching clinical outcomes , 2013, BMJ : British Medical Journal.

[19]  J. Robson,et al.  Lipid modification: cardiovascular risk assessment and the modification of blood lipids for the primary and secondary prevention of cardiovascular disease , 2007, Heart.

[20]  Geraint Rees,et al.  Clinically applicable deep learning for diagnosis and referral in retinal disease , 2018, Nature Medicine.

[21]  S. Tamang,et al.  Potential Biases in Machine Learning Algorithms Using Electronic Health Record Data , 2018, JAMA internal medicine.

[22]  Jeffrey Dean,et al.  Scalable and accurate deep learning with electronic health records , 2018, npj Digital Medicine.

[23]  Stephen Evans,et al.  Medicines and Healthcare Products Regulatory Agency (MHRA) (Formerly MCA) , 2005 .

[24]  Irene Dankwa-Mullan,et al.  Concordance assessment of a cognitive computing system in Thailand. , 2017 .

[25]  Won-Suk Lee,et al.  Use of a cognitive computing system for treatment of colon and gastric cancer in South Korea. , 2017 .

[26]  J. Ioannidis,et al.  Assessment of claims of improved prediction beyond the Framingham risk score. , 2009, JAMA.

[27]  Cody Agnellutti Big data : an exploration of opportunities, values, and privacy issues , 2014 .

[28]  Christine L. Borgman,et al.  On the Reuse of Scientific Data , 2017, Data Sci. J..

[29]  T. Freer,et al.  Screening mammography with computer-aided detection: prospective study of 12,860 patients in a community breast center. , 2001, Radiology.

[30]  Stephen Fenlon,et al.  The anaesthetist and the Medicines and Healthcare products Regulatory Agency , 2012 .

[31]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[32]  P. Royston,et al.  Prognosis and prognostic research: application and impact of prognostic models in clinical practice , 2009, BMJ : British Medical Journal.

[33]  Iain Chalmers,et al.  How to increase value and reduce waste when research priorities are set , 2014, The Lancet.

[34]  Iveta Simera,et al.  The EQUATOR Network: Enhancing the quality and transparency of health research through the use of reporting guidelines , 2008 .

[35]  Matt J. Kusner,et al.  Counterfactual Fairness , 2017, NIPS.

[36]  David Moher,et al.  The REporting of Studies Conducted Using Observational Routinely-Collected Health Data (RECORD) Statement: Methods for Arriving at Consensus and Developing Reporting Guidelines , 2015, PloS one.

[37]  John P. A. Ioannidis,et al.  How to Make More Published Research True , 2014, PLoS medicine.

[38]  John P. A. Ioannidis,et al.  What does research reproducibility mean? , 2016, Science Translational Medicine.

[39]  Chris Russell,et al.  Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR , 2017, ArXiv.

[40]  C. Lehman,et al.  Diagnostic Accuracy of Digital Screening Mammography With and Without Computer-Aided Detection. , 2015, JAMA internal medicine.

[41]  G. Collins,et al.  External validation of multivariable prediction models: a systematic review of methodological conduct and reporting , 2014, BMC Medical Research Methodology.

[42]  F. Cabitza,et al.  Unintended Consequences of Machine Learning in Medicine , 2017, JAMA.

[43]  Igor F Tsigelny,et al.  Artificial intelligence in drug combination therapy , 2019, Briefings Bioinform..

[44]  Patrick M M Bossuyt,et al.  Waste, Leaks, and Failures in the Biomarker Pipeline. , 2017, Clinical chemistry.

[45]  K. Covinsky,et al.  Assessing the Generalizability of Prognostic Information , 1999, Annals of Internal Medicine.

[46]  J. Ioannidis,et al.  External validation of new risk prediction models is infrequent and reveals worse prognostic discrimination. , 2015, Journal of clinical epidemiology.

[47]  Ran D Balicer,et al.  External validation and comparison of three prediction tools for risk of osteoporotic fractures using data from population based electronic health records: retrospective cohort study , 2017, British Medical Journal.

[48]  Rayid Ghani,et al.  Aequitas: A Bias and Fairness Audit Toolkit , 2018, ArXiv.

[49]  G. Collins,et al.  Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): The TRIPOD Statement , 2015, Annals of Internal Medicine.

[50]  Rustam Al-Shahi Salman,et al.  Increasing value and reducing waste in biomedical research regulation and management , 2014, The Lancet.

[51]  Benjamin Neale,et al.  Genomic Risk Prediction of Coronary Artery Disease in 480,000 Adults Implications for Primary Prevention , 2019 .

[52]  R. Tibshirani,et al.  Increasing value and reducing waste in research design, conduct, and analysis , 2014, The Lancet.

[53]  Eugenio Culurciello,et al.  An Analysis of Deep Neural Network Models for Practical Applications , 2016, ArXiv.

[54]  Yvonne Vergouwe,et al.  Prognosis and prognostic research: Developing a prognostic model , 2009, BMJ : British Medical Journal.

[55]  Lei Liu,et al.  Detecting Dysglycemia Using the 2015 United States Preventive Services Task Force Screening Criteria: A Cohort Analysis of Community Health Center Patients , 2016, PLoS medicine.

[56]  George Athanasopoulos,et al.  Forecasting: principles and practice , 2013 .

[57]  B. Frémont,et al.  SUMMARY OF SAFETY AND EFFECTIVENESS DATA , 2002 .

[58]  Piotr Sliz,et al.  A Quick Guide to Software Licensing for the Scientist-Programmer , 2012, PLoS Comput. Biol..

[59]  S. Uchino,et al.  Prediction Models and Their External Validation Studies for Mortality of Patients with Acute Kidney Injury: A Systematic Review , 2017, PloS one.

[60]  Elizabeth Ford,et al.  “Giving something back”: A systematic review and ethical enquiry into public views on the use of patient data for research in the United Kingdom and the Republic of Ireland , 2018, Wellcome open research.

[61]  D J Torgerson,et al.  Pragmatic trials: lab meets bedside , 2019, The British journal of dermatology.

[62]  Jesse A. Berlin,et al.  Assessing the Generalizability of Prognostic Information , 1999 .

[63]  Jon F. Claerbout,et al.  Electronic documents give reproducible research a new meaning: 62nd Ann , 1992 .

[64]  H. Pashler,et al.  Editors’ Introduction to the Special Section on Replicability in Psychological Science , 2012, Perspectives on psychological science : a journal of the Association for Psychological Science.

[65]  T. Nolan,et al.  The NHS heart age test will overload GPs who are already under huge pressure , 2018, British Medical Journal.

[66]  Matthew Willetts,et al.  Statistical machine learning of sleep and physical activity phenotypes from sensor data in 96,220 UK Biobank participants , 2017, Scientific Reports.

[67]  Julia Fu,et al.  Next steps for IBM Watson Oncology: Scalability to additional malignancies. , 2014 .

[68]  Y Ramya,et al.  Abstract S6-07: Double blinded validation study to assess performance of IBM artificial intelligence platform, Watson for oncology in comparison with Manipal multidisciplinary tumour board – First study of 638 breast cancer cases , 2017 .

[69]  Harlan M Krumholz,et al.  Increasing value and reducing waste: addressing inaccessible research , 2014, The Lancet.