MedPerf: Open Benchmarking Platform for Medical AI using Federated Evaluation (npj Digital Medicine, arxiv)

Medical AI has tremendous potential to advance healthcare by supporting the evidence-based practice of medicine, personalizing patient treatment, reducing costs, and improving provider and patient experience. We argue that unlocking this potential requires a systematic way to measure the performance of medical AI models on large-scale heterogeneous data. To meet this need, we are building MedPerf, an open framework for benchmarking machine learning in the medical domain. MedPerf will enable federated evaluation in which models are securely distributed to different facilities for evaluation, thereby empowering healthcare organizations to assess and verify the performance of AI models in an efficient and human-supervised process, while prioritizing privacy. We describe the current challenges healthcare and AI communities face, the need for an open platform, the design philosophy of MedPerf, its current implementation status, and our roadmap. We call for researchers and organizations to join us in creating the MedPerf open benchmarking platform. Code availability: we made all code available under an Apache license at https://github.com/mlcommons

[1]  Colin B. Compas,et al.  Federated learning for predicting clinical outcomes in patients with COVID-19 , 2021, Nature Medicine.

[2]  A. Regev,et al.  Pancreatic cancer risk predicted from disease trajectories using deep learning , 2021, bioRxiv.

[3]  Daniel Rueckert,et al.  End-to-end privacy preserving deep learning on multi-institutional medical imaging , 2021, Nature Machine Intelligence.

[4]  Micah J. Sheller,et al.  OpenFL: the open federated learning library , 2021, Physics in medicine and biology.

[5]  Christos Davatzikos,et al.  The Federated Tumor Segmentation (FeTS) Challenge , 2021, ArXiv.

[6]  G. Jackson,et al.  Artificial intelligence in oncology: Path to implementation , 2021, Cancer medicine.

[7]  Praveen K. Paritosh,et al.  “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI , 2021, CHI.

[8]  Jonathan W. Inselman,et al.  Artificial intelligence–enabled electrocardiograms for identification of patients with low ejection fraction: a pragmatic, randomized clinical trial , 2021, Nature Medicine.

[9]  A. Kesselheim,et al.  Continual learning in medical devices: FDA's action plan and beyond. , 2021, The Lancet. Digital health.

[10]  H. Aerts,et al.  Artificial intelligence for clinical oncology. , 2021, Cancer cell.

[11]  Daniel E. Ho,et al.  How medical AI devices are evaluated: limitations and recommendations from an analysis of FDA approvals , 2021, Nature Medicine.

[12]  T. Leiner,et al.  Bringing AI to the clinic: blueprint for a vendor-neutral AI deployment infrastructure , 2021, Insights into Imaging.

[13]  M. Marathe,et al.  Privacy-first health research with federated learning , 2020, npj Digital Medicine.

[14]  Curt P. Langlotz,et al.  Geographic Distribution of US Cohorts Used to Train Deep Learning Algorithms. , 2020, JAMA.

[15]  Ananda Theertha Suresh,et al.  Shuffled Model of Federated Learning: Privacy, Communication and Accuracy Trade-offs , 2020, ArXiv.

[16]  Spyridon Bakas,et al.  Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data , 2020, Scientific Reports.

[17]  Rickmer Braren,et al.  Secure, privacy-preserving and federated machine learning in medical imaging , 2020, Nature Machine Intelligence.

[18]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[19]  Lauren Wilcox,et al.  A Human-Centered Evaluation of a Deep Learning System Deployed in Clinics for the Detection of Diabetic Retinopathy , 2020, CHI.

[20]  B. Meskó,et al.  The state of artificial intelligence-based FDA-approved medical devices and algorithms: an online database , 2020, npj Digital Medicine.

[21]  Nigam H Shah,et al.  Ethics of Using and Sharing Clinical Imaging Data for Artificial Intelligence: A Proposed Framework. , 2020, Radiology.

[22]  Micah J. Sheller,et al.  The future of digital health with federated learning , 2020, npj Digital Medicine.

[23]  M. Lungren,et al.  Preparing Medical Imaging Data for Machine Learning. , 2020, Radiology.

[24]  Peixi Liu,et al.  Effect of a deep-learning computer-aided detection system on adenoma detection during colonoscopy (CADe-DB trial): a double-blind randomised study. , 2020, The lancet. Gastroenterology & hepatology.

[25]  Andreas Nürnberger,et al.  CHAOS Challenge - Combined (CT-MR) Healthy Abdominal Organ Segmentation , 2020, Medical Image Anal..

[26]  Daniel C. Castro,et al.  Causality matters in medical imaging , 2019, Nature Communications.

[27]  Richard Nock,et al.  Advances and Open Problems in Federated Learning , 2019, Found. Trends Mach. Learn..

[28]  Cody Coleman,et al.  MLPerf Inference Benchmark , 2019, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[29]  Brian W. Powers,et al.  Dissecting racial bias in an algorithm used to manage the health of populations , 2019, Science.

[30]  Cody A. Coleman,et al.  MLPerf Training Benchmark , 2019, MLSys.

[31]  David Moher,et al.  Reporting guidelines for clinical trials evaluating artificial intelligence interventions are needed , 2019, Nature Medicine.

[32]  Leo Anthony Celi,et al.  The “inconvenient truth” about AI in healthcare , 2019, npj Digital Medicine.

[33]  R. Hofmann-Wellenhof,et al.  Association Between Surgical Skin Markings in Dermoscopic Images and Diagnostic Performance of a Deep Learning Convolutional Neural Network for Melanoma Recognition. , 2019, JAMA dermatology.

[34]  Rishi Saripalle,et al.  Using HL7 FHIR to achieve interoperability in patient health record , 2019, J. Biomed. Informatics.

[35]  Yifan Yu,et al.  CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison , 2019, AAAI.

[36]  Roger G. Mark,et al.  MIMIC-CXR: A large publicly available database of labeled chest radiographs , 2019, ArXiv.

[37]  Pascal Vincent,et al.  fastMRI: An Open Dataset and Benchmarks for Accelerated MRI , 2018, ArXiv.

[38]  et al.,et al.  Identifying the Best Machine Learning Algorithms for Brain Tumor Segmentation, Progression Assessment, and Overall Survival Prediction in the BRATS Challenge , 2018, ArXiv.

[39]  Marcus A. Badgeley,et al.  Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study , 2018, PLoS medicine.

[40]  Daniel Forsberg,et al.  Implementation and Benefits of a Vendor-Neutral Archive and Enterprise-Imaging Management System in an Integrated Delivery Network , 2018, Journal of Digital Imaging.

[41]  Tahsin Kurc,et al.  Twenty Years of Digital Pathology: An Overview of the Road Travelled, What is on the Horizon, and the Emergence of Vendor-Neutral Archives , 2018, Journal of pathology informatics.

[42]  Giuseppe Ateniese,et al.  Deep Models Under the GAN: Information Leakage from Collaborative Deep Learning , 2017, CCS.

[43]  Andreas Holzinger,et al.  Interactive machine learning for health informatics: when do we need the human-in-the-loop? , 2016, Brain Informatics.

[44]  Brian B. Avants,et al.  The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS) , 2015, IEEE Transactions on Medical Imaging.

[45]  P. Prorok,et al.  Lung cancer screening with low-dose helical CT: results from the National Lung Screening Trial (NLST) , 2011, Journal of medical screening.

[46]  Pedagógia,et al.  Cross Sectional Study , 2019 .

[47]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[48]  P. Mildenberger,et al.  Introduction to the DICOM standard , 2002, European Radiology.

[49]  Robert Hedgpeth,et al.  The Path to Implementation , 2021 .

[50]  Yu-Chuan Li,et al.  Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers , 2015, MedInfo.