BioModelsML: Building a FAIR and reproducible collection of machine learning models in life sciences and medicine for easy reuse

Machine learning (ML) models are widely used in life sciences and medicine; however, they are scattered across various platforms and there are several challenges that hinder their accessibility, reproducibility and reuse. In this manuscript, we present the formalisation and pilot implementation of community protocol to enable FAIReR (Findable, Accessible, Interoperable, Reusable, and Reproducible) sharing of ML models. The protocol consists of eight steps, including sharing model training code, dataset information, reproduced figures, model evaluation metrics, trained models, Dockerfiles, model metadata, and FAIR dissemination. Applying these measures we aim to build and share a comprehensive public collection of FAIR ML models in the BioModels repository through incentivized community curation. In a pilot implementation, we curated diverse ML models to demonstrate the feasibility of our approach and we discussed the current challenges. Building a FAIReR collection of ML models will directly enhance the reproducibility and reusability of ML models, minimising the effort needed to reimplement models, maximising the impact on the application and significantly accelerating the advancement in the field of life science and medicine.

[1]  Wenbing Huang,et al.  Conditional Antibody Design as 3D Equivariant Graph Translation , 2022, ICLR.

[2]  R. Finn,et al.  A machine learning framework for discovery and enrichment of metagenomics metadata from open access publications , 2022, GigaScience.

[3]  E. Gibney Could machine learning fuel a reproducibility crisis in science? , 2022, Nature.

[4]  S. Ovchinnikov,et al.  Scaffolding protein functional sites using deep learning , 2022, Science.

[5]  A. Narayanan,et al.  Leakage and the Reproducibility Crisis in ML-based Science , 2022, ArXiv.

[6]  A. Tivey,et al.  Search and sequence analysis tools services from EMBL-EBI in 2022 , 2022, Nucleic Acids Res..

[7]  M. Ladanyi,et al.  Improved prediction of immune checkpoint blockade efficacy across multiple cancer types , 2021, Nature Biotechnology.

[8]  Su-In Lee,et al.  Reproducibility standards for machine learning in the life sciences , 2021, Nature Methods.

[9]  Matthew B. A. McDermott,et al.  Reproducibility in machine learning for health research: Still a ways to go , 2021, Science Translational Medicine.

[10]  Silvio C. E. Tosatto,et al.  APICURON: a database to credit and acknowledge the work of biocurators , 2021, bioRxiv.

[11]  Henning Hermjakob,et al.  Reproducibility in systems biology modelling , 2020, bioRxiv.

[12]  Silvio C.E. Tosatto,et al.  DOME: recommendations for supervised machine learning validation in biology , 2020, Nature Methods.

[13]  Henning Hermjakob,et al.  BioModels—15 years of sharing computational models in life science , 2019, Nucleic Acids Res..

[14]  Mohammad Tariqul Islam,et al.  Machine learning approach of automatic identification and counting of blood cells , 2019, Healthcare technology letters.

[15]  Jun Cheng,et al.  The Kipoi repository accelerates community exchange and reuse of predictive models for genomics , 2019, Nature Biotechnology.

[16]  Matthew England,et al.  PLIT: An alignment-free computational tool for identification of long non-coding RNAs in plant transcriptomic datasets , 2019, Comput. Biol. Medicine.

[17]  Robert Petryszak,et al.  Discovering and linking public omics data sets using the Omics Discovery Index , 2017, Nature Biotechnology.

[18]  Alan Ruttenberg,et al.  The Cell Ontology 2016: enhanced content, modularization, and ontology interoperability , 2016, J. Biomed. Semant..

[19]  M. Baker 1,500 scientists lift the lid on reproducibility , 2016, Nature.

[20]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[21]  Steve Pettifer,et al.  EDAM: an ontology of bioinformatics operations, types of data and identifiers, topics and formats , 2013, Bioinform..

[22]  S. Lewis,et al.  Uberon, an integrative multi-species anatomy ontology , 2012, Genome Biology.

[23]  Hugh D. Spence,et al.  Minimum information requested in the annotation of biochemical models (MIRIAM) , 2005, Nature Biotechnology.

[24]  James A. Hendler,et al.  The National Cancer Institute's Thésaurus and Ontology , 2003, J. Web Semant..

[25]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.