Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI Extension

Abstract The SPIRIT 2013 (The Standard Protocol Items: Recommendations for Interventional Trials) statement aims to improve the completeness of clinical trial protocol reporting, by providing evidence-based recommendations for the minimum set of items to be addressed. This guidance has been instrumental in promoting transparent evaluation of new interventions. More recently, there is a growing recognition that interventions involving artificial intelligence need to undergo rigorous, prospective evaluation to demonstrate their impact on health outcomes. The SPIRIT-AI extension is a new reporting guideline for clinical trials protocols evaluating interventions with an AI component. It was developed in parallel with its companion statement for trial reports: CONSORT-AI. Both guidelines were developed using a staged consensus process, involving a literature review and expert consultation to generate 26 candidate items, which were consulted on by an international multi-stakeholder group in a 2-stage Delphi survey (103 stakeholders), agreed on in a consensus meeting (31 stakeholders) and refined through a checklist pilot (34 participants). The SPIRIT-AI extension includes 15 new items, which were considered sufficiently important for clinical trial protocols of AI interventions. These new items should be routinely reported in addition to the core SPIRIT 2013 items. SPIRIT-AI recommends that investigators provide clear descriptions of the AI intervention, including instructions and skills required for use, the setting in which the AI intervention will be integrated, considerations around the handling of input and output data, the human-AI interaction and analysis of error cases. SPIRIT-AI will help promote transparency and completeness for clinical trial protocols for AI interventions. Its use will assist editors and peer-reviewers, as well as the general readership, to understand, interpret and critically appraise the design and risk of bias for a planned clinical trial.

Gary S Collins | M Khair ElZarrad | David Moher | Ara Darzi | Andre Esteva | Aaron Y. Lee | Hutan Ashrafian | Luke Oakden-Rayner | Christopher Yau | Aaron Y Lee | Jonathan J Deeks | Hugh Harvey | Charlotte Haug | Livia Faes | Pearse A Keane | Melissa McCradden | Gary Price | Adrian Jonas | An-Wen Chan | Rupa Sarkar | Melanie J Calvert | Xiaoxuan Liu | Alastair K Denniston | Cecilia S Lee | Cecilia S. Lee | C. Yau | S. Vollmer | A. Darzi | G. Collins | P. Keane | Andre Esteva | D. Moher | C. Mulrow | H. Ashrafian | Christopher J. Kelly | R. Golub | J. Deeks | Xiaoxuan Liu | L. Faes | A. Denniston | A. Chan | L. Oakden-Rayner | H. Harvey | D. Paltoo | Samantha Cruz Rivera | M. Calvert | L. Ferrante di Ruffano | C. Haug | John Fletcher | Samantha Cruz Rivera | A. Beam | M. Elzarrad | Cyrus Espinoza | J. Fletcher | Christopher Holmes | Adrian Jonas | Elaine Manna | J. Matcham | M. McCradden | Joao Monteiro | M. Panico | G. Price | Samuel d. Rowley | Richard Savage | Rupa Sarkar | Cynthia Mulrow | Christopher Holmes | Andrew L Beam | Cyrus Espinoza | Lavinia Ferrante di Ruffano | Robert Golub | Christopher J Kelly | Elaine Manna | James Matcham | Joao Monteiro | Dina Paltoo | Maria Beatrice Panico | Samuel Rowley | Richard Savage | Sebastian J Vollmer | A. Esteva | Ara Christopher Christopher David Hutan Jonathan J. La Darzi Holmes Yau Moher Ashrafian Deeks Ferran | Lavinia Ferrante di Ruffano | Aaron Y. Adrian Andre Andrew L. Maria Beatrice Cecilia S Lee Jonas Esteva Beam Panico Lee Haug Kelly | João Monteiro | M. Mccradden | S. Cruz Rivera

[1]  David Moher,et al.  SPIRIT 2013 explanation and elaboration: guidance for protocols of clinical trials , 2013, BMJ.

[2]  W. Zhou,et al.  Detection of colorectal adenomas with a real-time computer-aided system (ENDOANGEL): a randomised controlled study. , 2020, The lancet. Gastroenterology & hepatology.

[3]  Ryanne A. Brown,et al.  Impact of a deep learning assistant on the histopathologic classification of liver cancer , 2020, npj Digital Medicine.

[4]  M. Abràmoff,et al.  Improved Automated Detection of Diabetic Retinopathy on a Publicly Available Dataset Through Integration of Deep Learning. , 2016, Investigative ophthalmology & visual science.

[5]  T. Berzin,et al.  Real-time automatic detection system increases colonoscopic polyp and adenoma detection rates: a prospective randomised controlled study , 2019, Gut.

[6]  Daniel C. Sadowski,et al.  An overview of clinical decision support systems: benefits, risks, and strategies for success , 2020, npj Digital Medicine.

[7]  Rodrigo C. Barros,et al.  Can we trust deep learning models diagnosis? The impact of domain shift in chest radiograph classification , 2019, TIA@MICCAI.

[8]  A. Ng,et al.  Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists , 2018, PLoS medicine.

[9]  Gema García-Sáez,et al.  A web-based clinical decision support system for gestational diabetes: Automatic diet prescription and detection of insulin needs , 2017, Int. J. Medical Informatics.

[10]  E. Topol,et al.  A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. , 2019, The Lancet. Digital health.

[11]  D. Moher,et al.  Use of guidelines to improve the quality and transparency of reporting oral health research. , 2015, Journal of dentistry.

[12]  M. P. Mulder,et al.  Effect of a Machine Learning-Derived Early Warning System for Intraoperative Hypotension vs Standard Care on Depth and Duration of Intraoperative Hypotension During Elective Noncardiac Surgery: The HYPE Randomized Clinical Trial. , 2020, JAMA.

[13]  J. Chong,et al.  Top 10 Reviewer Critiques of Radiology Artificial Intelligence (AI) Articles: Qualitative Thematic Analysis of Reviewer Critiques of Machine Learning/Deep Learning Manuscripts Submitted to JMRI , 2020, Journal of magnetic resonance imaging : JMRI.

[14]  Jared A. Dunnmon,et al.  Hidden stratification causes clinically meaningful failures in machine learning for medical imaging , 2019, CHIL.

[15]  Douglas Heaven,et al.  Why deep-learning AIs are so easy to fool , 2019, Nature.

[16]  D. Baumgart,et al.  An overview of clinical decision support systems: benefits, risks, and strategies for success. , 2020, NPJ digital medicine.

[17]  Carl F. Sabottke,et al.  The Effect of Image Resolution on Deep Learning in Radiography. , 2020, Radiology. Artificial intelligence.

[18]  Peixi Liu,et al.  Effect of a deep-learning computer-aided detection system on adenoma detection during colonoscopy (CADe-DB trial): a double-blind randomised study. , 2020, The lancet. Gastroenterology & hepatology.

[19]  Wei Zhou,et al.  Randomised controlled trial of WISENSE, a real-time quality improving system for monitoring blind spots during esophagogastroduodenoscopy , 2019, Gut.

[20]  Wade W. Hilts,et al.  An artificial intelligence decision support system for the management of type 1 diabetes , 2020, Nature metabolism.

[21]  Livia Faes,et al.  Extension of the CONSORT and SPIRIT statements , 2019, The Lancet.

[22]  Shahar Azulay,et al.  Assessment of a Personalized Approach to Predicting Postprandial Glycemic Responses to Food Among Individuals Without Diabetes. , 2019, JAMA network open.

[23]  Ryanne A. Brown,et al.  Impact of a deep learning assistant on the histopathologic classification of liver cancer. , 2020, NPJ digital medicine.

[24]  D. Moher,et al.  Guidance for Developers of Health Research Reporting Guidelines , 2010, PLoS medicine.

[25]  P. van der Veer,et al.  Spirit , 2011, American Afterlives.

[26]  Xiaohang Wu,et al.  Diagnostic Efficacy and Therapeutic Decision-making Capacity of an Artificial Intelligence Platform for Childhood Cataracts in Eye Clinics: A Multicentre Randomized Controlled Trial , 2019, EClinicalMedicine.

[27]  Nicolette de Keizer,et al.  STARE-HI -Statement on Reporting of Evaluation Studies in Health Informatics , 2009, Yearbook of Medical Informatics.

[28]  Ananth Ravi,et al.  Evaluation of a Machine-Learning Algorithm for Treatment Planning in Prostate Low-Dose-Rate Brachytherapy. , 2017, International journal of radiation oncology, biology, physics.

[29]  Ibrahim Habli,et al.  Artificial intelligence in health care: accountability and safety , 2020, Bulletin of the World Health Organization.

[30]  J. Goo,et al.  Preoperative CT-based Deep Learning Model for Predicting Disease-Free Survival in Patients with Lung Adenocarcinomas. , 2020, Radiology.

[31]  J. Ioannidis,et al.  Artificial intelligence versus clinicians: systematic review of design, reporting standards, and claims of deep learning studies , 2020, BMJ.

[32]  David C. Kale,et al.  Do no harm: a roadmap for responsible machine learning for health care , 2019, Nature Medicine.

[33]  Tae Won Benjamin Kim,et al.  Internet-Based Exercise Therapy Using Algorithms for Conservative Treatment of Anterior Knee Pain: A Pragmatic Randomized Controlled Trial , 2016, JMIR rehabilitation and assistive technologies.

[34]  Joanna Coast,et al.  Guidelines for Inclusion of Patient-Reported Outcomes in Clinical Trial Protocols: The SPIRIT-PRO Extension , 2018, JAMA.

[35]  Andrew L. Beam,et al.  Adversarial attacks on medical machine learning , 2019, Science.

[36]  David S. Melnick,et al.  International evaluation of an AI system for breast cancer screening , 2020, Nature.

[37]  A. Darzi,et al.  Developing specific reporting guidelines for diagnostic accuracy studies assessing AI interventions: The STARD-AI Steering Group , 2020, Nature Medicine.

[38]  Geraint Rees,et al.  Clinically applicable deep learning for diagnosis and referral in retinal disease , 2018, Nature Medicine.

[39]  Mark Hoogendoorn,et al.  Machine learning for the prediction of sepsis: a systematic review and meta-analysis of diagnostic test accuracy , 2020, Intensive Care Medicine.

[40]  Aaron Y. Lee,et al.  Clinical applications of continual learning machine learning. , 2020, The Lancet. Digital health.

[41]  National Institute for Health and Care Excellence (NICE) , 2019, The Grants Register 2020.

[42]  D. Rennie,et al.  SPIRIT 2013 statement: defining standard protocol items for clinical trials. , 2013, Annals of internal medicine.

[43]  Laura Shafner,et al.  Using Artificial Intelligence to Reduce the Risk of Nonadherence in Patients on Anticoagulation Therapy , 2017, Stroke.

[44]  Jie Xu,et al.  The practical implementation of artificial intelligence technologies in medicine , 2019, Nature Medicine.

[45]  Jin-Young Choi,et al.  Development and Validation of a Deep Learning System for Staging Liver Fibrosis by Using Contrast Agent-enhanced CT Images in the Liver. , 2018, Radiology.

[46]  Sebastian Thrun,et al.  Dermatologist-level classification of skin cancer with deep neural networks , 2017, Nature.

[47]  Gary S. Collins,et al.  Reporting of artificial intelligence prediction models , 2019, The Lancet.

[48]  D. Hassabis,et al.  Predicting conversion to wet age-related macular degeneration using deep learning , 2020, Nature Medicine.

[49]  David Moher,et al.  Reporting guidelines for clinical trials evaluating artificial intelligence interventions are needed , 2019, Nature Medicine.

[50]  Xiu-Li Zuo,et al.  Impact of real-time automatic quality control system on colorectal polyp and adenoma detection: a prospective randomized controlled study (with video). , 2020, Gastrointestinal endoscopy.

[51]  A. Chan,et al.  Standard Protocol Items for Clinical Trials with Traditional Chinese Medicine 2018: Recommendations, Explanation and Elaboration (SPIRIT-TCM Extension 2018) , 2018, Chinese Journal of Integrative Medicine.

[52]  Mustafa Suleyman,et al.  Key challenges for delivering clinical impact with artificial intelligence , 2019, BMC Medicine.

[53]  Y. Wang,et al.  Author ' s , 2010 .

[54]  Peter Washington,et al.  Effect of Wearable Digital Intervention for Improving Socialization in Children With Autism Spectrum Disorder: A Randomized Clinical Trial , 2019, JAMA pediatrics.

[55]  Marcus A. Badgeley,et al.  Confounding variables can degrade generalization performance of radiological deep learning models , 2018, ArXiv.