COPO: a metadata platform for brokering FAIR data in the life sciences

Scientific innovation is increasingly reliant on data and computational resources. Much of today’s life science research involves generating, processing, and reusing heterogeneous datasets that are growing exponentially in size. Demand for technical experts (data scientists and bioinformaticians) to process these data is at an all-time high, but these are not typically trained in good data management practices. That said, we have come a long way in the last decade, with funders, publishers, and researchers themselves making the case for open, interoperable data as a key component of an open science philosophy. In response, recognition of the FAIR Principles (that data should be Findable, Accessible, Interoperable and Reusable) has become commonplace. However, both technical and cultural challenges for the implementation of these principles still exist when storing, managing, analysing and disseminating both legacy and new data. COPO is a computational system that attempts to address some of these challenges by enabling scientists to describe their research objects (raw or processed data, publications, samples, images, etc.) using community-sanctioned metadata sets and vocabularies, and then use public or institutional repositories to share them with the wider scientific community. COPO encourages data generators to adhere to appropriate metadata standards when publishing research objects, using semantic terms to add meaning to them and specify relationships between them. This allows data consumers, be they people or machines, to find, aggregate, and analyse data which would otherwise be private or invisible, building upon existing standards to push the state of the art in scientific data dissemination whilst minimising the burden of data publication and sharing.

[1]  Daniele Fanelli,et al.  Opinion: Is science really facing a reproducibility crisis, and do we need it to? , 2018, Proceedings of the National Academy of Sciences.

[2]  J. Ribaut,et al.  Modernising breeding for orphan crops: tools, methodologies, and beyond , 2019, Planta.

[3]  Biocuration: Distilling data into knowledge , 2018, PLoS biology.

[4]  Rolf Backofen,et al.  Practical computational reproducibility in the life sciences , 2017, bioRxiv.

[5]  Qian Li,et al.  Sustainable funding for biocuration: The Arabidopsis Information Resource (TAIR) as a case study of a subscription-based funding model , 2016, Database J. Biol. Databases Curation.

[6]  A. Maslow A Theory of Human Motivation , 1943 .

[7]  Thomas R. Gruber,et al.  A translation approach to portable ontology specifications , 1993, Knowl. Acquis..

[8]  Fulvio Mazzocchi,et al.  Could Big Data be the end of theory in science? , 2015, EMBO reports.

[9]  Lennart Martens,et al.  The Ontology Lookup Service: more data and better tools for controlled vocabulary queries , 2008, Nucleic Acids Res..

[10]  Jason Williams,et al.  Unmet needs for analyzing biological big data: A survey of 704 NSF principal investigators , 2017, bioRxiv.

[11]  Barry Smith,et al.  The Plant Ontology as a Tool for Comparative Plant Anatomy and Genomic Analyses , 2012, Plant & cell physiology.

[12]  Damion M. Dooley,et al.  FoodOn: a harmonized food ontology to increase global food traceability, quality control and data integration , 2018, npj Science of Food.

[13]  Cni Sparc Arl,et al.  Last Mile: Liaison Roles in Curating Science and Engineering Research Data (RLI 265, Aug. 2009) , 2009 .

[14]  Cristina Ribeiro,et al.  Dendro: Collaborative Research Data Management Built on Linked Open Data , 2014, ESWC.

[15]  Martin J. O'Connor,et al.  The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata that Describe Scientific Experiments , 2017, SEMWEB.

[16]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[17]  Anne E. Trefethen,et al.  Toward interoperable bioscience data , 2012, Nature Genetics.

[18]  Gang Fu,et al.  Disease Ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data , 2014, Nucleic Acids Res..

[19]  Tracy Gabridge The Last Mile: Liaison Roles in Curating Science and Engineering Research Data , 2009 .

[20]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[21]  Sabina Leonelli,et al.  Data management and best practice for plant science , 2017, Nature Plants.

[22]  Scott Federhen,et al.  The NCBI Taxonomy database , 2011, Nucleic Acids Res..

[23]  Franck Michel,et al.  Bioschemas & Schema.org: a Lightweight Semantic Layer for Life Sciences Websites , 2018 .

[24]  Elizabeth Arnaud,et al.  Applying FAIR Principles to Plant Phenotypic Data Management in GnpIS , 2019, Plant phenomics.

[25]  B Marshall,et al.  Gene Ontology Consortium: The Gene Ontology (GO) database and informatics resource , 2004, Nucleic Acids Res..

[26]  Uwe Scholz,et al.  BrAPI—an application programming interface for plant breeding applications , 2019, Bioinform..

[27]  G. Kruseman CGIAR platform for big data in agriculture , 2017 .

[28]  S. Lewis,et al.  Uberon, an integrative multi-species anatomy ontology , 2012, Genome Biology.

[29]  K. Cranmer,et al.  Open is not enough , 2018, Nature Physics.

[30]  James B Brown,et al.  *-DCC: A platform to collect, annotate, and explore a large variety of sequencing experiments , 2020, GigaScience.

[31]  Massimiliano Izzo,et al.  FAIRsharing as a community approach to standards, repositories and policies , 2019, Nature Biotechnology.

[32]  Robyn B. Reed,et al.  figshare for Institutions , 2017 .

[33]  Mark A. Musen,et al.  AgroPortal: A vocabulary and ontology repository for agronomy , 2018, Comput. Electron. Agric..

[34]  Susanna-Assunta Sansone,et al.  linkedISA: semantic representation of ISA-Tab experimental metadata , 2014, BMC Bioinformatics.

[35]  Sabina Leonelli,et al.  What difference does quantity make? On the epistemology of Big Data in biology , 2014, Big Data Soc..