Towards International Standards for the Evaluation of Artificial Intelligence for Health

Healthcare can benefit considerably from advanced information processing technologies, in particular from machine learning (ML) and artificial intelligence (AI). However, the health domain only hesitantly adopts these powerful but complex innovations so far, because any technical fault can affect people's health, privacy, and consequently their entire lives. In this paper, we substantiate that international standards are required for thoroughly validating AI solutions for health, by benchmarking their performance. These standards might ultimately create well-founded trust in those AI solutions that have provided conclusive evidence to be accurate, effective and reliable. We give reasons that standardized benchmarking of AI solutions for health is a necessary complement of established assessment procedures. In particular, we demonstrate that it is beneficial to tackle this topic on a global scale and summarize the achievements of the first year of the ITU/WHO focus group on “AI for Health” that has tasked itself to work towards creating these evaluation standards.

[1]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[2]  Donald J. Trump,et al.  Executive Order 13859: Maintaining American Leadership in Artificial Intelligence , 2019 .

[3]  Avrim Blum,et al.  The Ladder: A Reliable Leaderboard for Machine Learning Competitions , 2015, ICML.

[4]  S. Friend,et al.  Crowdsourcing biomedical research: leveraging communities as innovation engines , 2016, Nature Reviews Genetics.

[5]  Anne E Carpenter,et al.  Opportunities and obstacles for deep learning in biology and medicine , 2017, bioRxiv.

[6]  G. Collins,et al.  Uniformity in measuring adherence to reporting guidelines: the example of TRIPOD for assessing completeness of reporting of prediction model studies , 2019, BMJ Open.

[7]  Sebastian Thrun,et al.  Dermatologist-level classification of skin cancer with deep neural networks , 2017, Nature.

[8]  Bram van Ginneken,et al.  A survey on deep learning in medical image analysis , 2017, Medical Image Anal..

[9]  Andre Esteva,et al.  A guide to deep learning in healthcare , 2019, Nature Medicine.

[10]  Masaru Ishii,et al.  Towards computational fluorescence microscopy: Machine learning-based integrated prediction of morphological and molecular tumor profiles , 2018, ArXiv.

[11]  Paul Voosen,et al.  The AI detectives. , 2017, Science.

[12]  Jie Xu,et al.  The practical implementation of artificial intelligence technologies in medicine , 2019, Nature Medicine.

[13]  Yibo Zhang,et al.  Deep learning enhanced mobile-phone microscopy , 2017, ACS Photonics.

[14]  C. Gidengil,et al.  Evaluation of symptom checkers for self diagnosis and triage: audit study , 2015, BMJ : British Medical Journal.

[15]  K-R Müller,et al.  Scoring of tumor-infiltrating lymphocytes: From visual estimation to machine learning. , 2018, Seminars in cancer biology.

[16]  Alexander S. Ecker,et al.  Benchmarking Robustness in Object Detection: Autonomous Driving when Winter is Coming , 2019, ArXiv.

[17]  Lu Lu,et al.  How to Host An Effective Data Competition: Statistical Advice for Competition Design and Analysis , 2019, Stat. Anal. Data Min..

[18]  Christoph Meinel,et al.  Deep Learning for Medical Image Analysis , 2018, Journal of Pathology Informatics.

[19]  Klaus-Robert Müller,et al.  Introduction to machine learning for brain imaging , 2011, NeuroImage.

[20]  尚弘 島影 National Institute of Standards and Technologyにおける超伝導研究及び生活 , 2001 .

[21]  Shamim Nemati,et al.  An Interpretable Machine Learning Model for Accurate Prediction of Sepsis in the ICU , 2017, Critical care medicine.

[22]  Jennifer Couzin-Frankel,et al.  Medicine contends with how to use artificial intelligence. , 2019, Science.

[23]  Atul J. Butte,et al.  Assessment of a Deep Learning Model Based on Electronic Health Record Data to Forecast Clinical Outcomes in Patients With Rheumatoid Arthritis , 2019, JAMA network open.

[24]  Andrew L. Beam,et al.  Adversarial attacks on medical machine learning , 2019, Science.

[25]  J. Chi,et al.  Automated Detection of P. falciparum Using Machine Learning Algorithms with Quantitative Phase Images of Unstained Cells , 2016, PloS one.

[26]  Kamran Sartipi,et al.  HL7 FHIR: An Agile and RESTful approach to healthcare information exchange , 2013, Proceedings of the 26th IEEE International Symposium on Computer-Based Medical Systems.

[27]  Alexander Binder,et al.  Unmasking Clever Hans predictors and assessing what machines really learn , 2019, Nature Communications.

[28]  P. Mildenberger,et al.  Introduction to the DICOM standard , 2002, European Radiology.

[29]  Ari Ercole,et al.  Optimal intensive care outcome prediction over time using machine learning , 2018, PloS one.

[30]  Miles Brundage,et al.  The Role of Cooperation in Responsible AI Development , 2019, ArXiv.

[31]  Aaron Carass,et al.  Why rankings of biomedical image analysis competitions should be interpreted with care , 2018, Nature Communications.

[32]  Masoumeh Haghpanahi,et al.  Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network , 2019, Nature Medicine.

[33]  Ziad Obermeyer,et al.  Regulation of predictive analytics in medicine , 2019, Science.

[34]  Nils Strodthoff,et al.  Detecting and interpreting myocardial infarction using fully convolutional neural networks , 2018, Physiological measurement.

[35]  J. Samet,et al.  From the Food and Drug Administration. , 2002, JAMA.

[36]  Aidan N. Gomez,et al.  Benchmarking Bayesian Deep Learning with Diabetic Retinopathy Diagnosis , 2019 .

[37]  Organización Mundial de la Salud Guidelines for the treatment of malaria , 2010 .

[38]  Lei Ying,et al.  Nanophotonic media for artificial neural inference , 2018, Photonics Research.

[39]  Thomas Wiegand,et al.  WHO and ITU establish benchmarking process for artificial intelligence in health , 2019, The Lancet.

[40]  Gary S Collins,et al.  Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): Explanation and Elaboration , 2015, Annals of Internal Medicine.

[41]  Bram van Ginneken,et al.  Google’s lung cancer AI: a promising tool that needs further validation , 2019, Nature Reviews Clinical Oncology.

[42]  Hao Chen,et al.  Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: The LUNA16 challenge , 2016, Medical Image Anal..

[43]  Gary S. Collins,et al.  Reporting of artificial intelligence prediction models , 2019, The Lancet.