The role of metadata in reproducible computational research

Summary Reproducible computational research (RCR) is the keystone of the scientific method for in silico analyses, packaging the transformation of raw data to published results. In addition to its role in research integrity, improving the reproducibility of scientific studies can accelerate evaluation and reuse. This potential and wide support for the FAIR principles have motivated interest in metadata standards supporting reproducibility. Metadata provide context and provenance to raw data and methods and are essential to both discovery and validation. Despite this shared connection with scientific data, few studies have explicitly described how metadata enable reproducible computational research. This review employs a functional content analysis to identify metadata standards that support reproducibility across an analytic stack consisting of input data, tools, notebooks, pipelines, and publications. Our review provides background context, explores gaps, and discovers component trends of embeddedness and methodology weight from which we derive recommendations for future work.

[1]  David L Donoho,et al.  An invitation to reproducible computational research. , 2010, Biostatistics.

[2]  Brett K. Beaulieu-Jones,et al.  Reproducibility of computational workflows is automated using continuous analysis , 2017, Nature Biotechnology.

[3]  L. Stein The case for cloud computing in genome informatics , 2010, Genome Biology.

[4]  Manjula Patel,et al.  Application Profiles: Mixing and Matching Metadata Schemas , 2000 .

[5]  Michel Dumontier,et al.  A design framework and exemplar metrics for FAIRness , 2017, Scientific Data.

[6]  Laura Christopherson,et al.  Data Management Lifecycle and Software Lifecycle Management in the Context of Conducting Science , 2014 .

[7]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[8]  Joelle Pineau,et al.  Improving Reproducibility in Machine Learning Research (A Report from the NeurIPS 2019 Reproducibility Program) , 2020, J. Mach. Learn. Res..

[9]  C. Begley,et al.  Drug development: Raise standards for preclinical cancer research , 2012, Nature.

[10]  Hedi Peterson,et al.  Using bio.tools to generate and annotate workbench tool descriptions , 2017, F1000Research.

[11]  Mary Goldman,et al.  Toil enables reproducible, open source, big biomedical data analyses , 2017, Nature Biotechnology.

[12]  Bertram Ludäscher,et al.  Kepler: an extensible system for design and execution of scientific workflows , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[13]  The Gene Ontology Consortium,et al.  The Gene Ontology Resource: 20 years and still GOing strong , 2018, Nucleic Acids Res..

[14]  Pascal Vincent,et al.  Unreproducible Research is Reproducible , 2019, ICML.

[15]  Massimiliano Izzo,et al.  FAIRsharing as a community approach to standards, repositories and policies , 2019, Nature Biotechnology.

[16]  Jane Greenberg,et al.  Metadata Capital in a Data Repository , 2013, Dublin Core Conference.

[17]  Uwe Scholz,et al.  Enabling reusability of plant phenomic datasets with MIAPPE 1.1 , 2020, The New phytologist.

[18]  David De Roure,et al.  A Framework for the Preservation of a Docker Container , 2018, Int. J. Digit. Curation.

[19]  Ted Slater,et al.  Recent advances in modeling languages for pathway maps and computable biological networks. , 2014, Drug discovery today.

[20]  Jane Greenberg,et al.  Data objects and documenting scientific processes: An analysis of data events in biodiversity data papers , 2019, J. Assoc. Inf. Sci. Technol..

[21]  Silvio C. E. Tosatto,et al.  Tools and data services registry: a community effort to document bioinformatics resources , 2015, Nucleic Acids Res..

[22]  Ryan Miller,et al.  WikiPathways: a multifaceted pathway database bridging metabolomics to other omics research , 2017, Nucleic Acids Res..

[23]  Kei-Hoi Cheung,et al.  BioPAX – A community standard for pathway data sharing , 2010, Nature Biotechnology.

[24]  Juan Pablo Alperin,et al.  The evaluation of scholarship in academic promotion and tenure processes: Past, present, and future , 2018, F1000Research.

[25]  Douglas G Altman,et al.  Key Issues in Conducting a Meta-Analysis of Gene Expression Microarray Datasets , 2008, PLoS medicine.

[26]  Atsuyuki Morishima,et al.  Digital Libraries: Knowledge, Information, and Data in an Open Access Society , 2016, Lecture Notes in Computer Science.

[27]  Silvio Peroni,et al.  The Semantic Publishing and Referencing Ontologies , 2014 .

[28]  Fabio Vitali,et al.  The Publishing Workflow Ontology (PWO) , 2017, Semantic Web.

[29]  Yolanda Gil,et al.  OntoSoft: Capturing Scientific Software Metadata , 2015, K-CAP.

[30]  K. Baggerly Disclose all data in publications. , 2010, Nature.

[31]  Yaxing Wei,et al.  YesWorkflow: A User-Oriented, Language-Independent Tool for Recovering Workflow Information from Scripts , 2015, ArXiv.

[32]  Friedrich Leisch,et al.  Sweave: Dynamic Generation of Statistical Reports Using Literate Data Analysis , 2002, COMPSTAT.

[33]  Harvey J Motulsky,et al.  Common Misconceptions about Data Analysis and Statistics , 2014, The Journal of Pharmacology and Experimental Therapeutics.

[34]  Volker Schmid,et al.  Working with the DICOM and NIfTI Data Standards in R , 2011 .

[35]  Gabor Fichtinger,et al.  dcmqi: An Open Source Library for Standardized Communication of Quantitative Image Analysis Results Using DICOM. , 2017, Cancer research.

[36]  John A Bachman,et al.  FamPlex: a resource for entity recognition and relationship resolution of human protein families and complexes in biomedical text mining , 2018 .

[37]  William K. Michener,et al.  Meta-information concepts for ecological data management , 2006, Ecol. Informatics.

[38]  James C. Hu,et al.  The Gene Ontology Resource: 20 years and still GOing strong , 2019 .

[39]  Thawfeek M. Varusai,et al.  The Reactome Pathway Knowledgebase , 2017, Nucleic acids research.

[40]  Stian Soiland-Reyes,et al.  PAV ontology: provenance, authoring and versioning , 2013, J. Biomed. Semant..

[41]  Jason E. Stewart,et al.  Minimum information about a microarray experiment (MIAME)—toward standards for microarray data , 2001, Nature Genetics.

[42]  Daniel S. Katz,et al.  Journal of Open Source Software (JOSS): design and first-year review , 2017, PeerJ Comput. Sci..

[43]  Chengqi Zhang,et al.  Data preparation for data mining , 2003, Appl. Artif. Intell..

[44]  Minoru Kanehisa,et al.  KEGG: new perspectives on genomes, pathways, diseases and drugs , 2016, Nucleic Acids Res..

[45]  D. Moher,et al.  Transparent and accurate reporting increases reliability, utility, and impact of your research: reporting guidelines and the EQUATOR Network , 2010, BMC medicine.

[46]  Lois Mai Chan Library of Congress Subject Headings: Principles and Application. Third Edition. , 1978 .

[47]  K. Casey,et al.  Cumulative human impacts: raw stressor data (2008 and 2013) , 2015 .

[48]  Margo I. Seltzer,et al.  StarFlow: A Script-Centric Data Analysis Environment , 2010, IPAW.

[49]  et al.,et al.  Jupyter Notebooks - a publishing format for reproducible computational workflows , 2016, ELPUB.

[50]  Angela P. Murillo Examining data sharing and data reuse in the dataone environment , 2014, ASIST.

[51]  Klaus Rechert,et al.  Preserving Containers - Requirements and a Todo-List , 2016, ICADL.

[52]  Susanna-Assunta Sansone,et al.  Semantic concept schema of the linear mixed model of experimental observations , 2020, Scientific Data.

[53]  W. D. Bidgood,et al.  Introduction to the ACR-NEMA DICOM standard. , 1992, Radiographics : a review publication of the Radiological Society of North America, Inc.

[54]  John A. Kunze,et al.  The BagIt File Packaging Format (V1.0) , 2018, RFC.

[55]  Rüdiger Wirth,et al.  CRISP-DM: Towards a Standard Process Model for Data Mining , 2000 .

[56]  Daniel Nüst,et al.  Opening the Publication Process with Executable Research Compendia , 2017, D Lib Mag..

[57]  Charles A. Sutton,et al.  Mining source code repositories at massive scale using language modeling , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[58]  Michelle Dunn,et al.  The National Institutes of Health's Big Data to Knowledge (BD2K) initiative: capitalizing on biomedical big data , 2014, J. Am. Medical Informatics Assoc..

[59]  W. Pearson Rapid and sensitive sequence comparison with FASTP and FASTA. , 1990, Methods in enzymology.

[60]  Lucila Ohno-Machado,et al.  The Data Tags Suite (DATS) model for discovering data access and use requirements , 2020, GigaScience.

[61]  Michel Dumontier,et al.  FAIRshake: toolkit to evaluate the findability, accessibility, interoperability, and reusability of research digital resources , 2019, bioRxiv.

[62]  Mikel Egaña Aranguren,et al.  Enhanced reproducibility of SADI web service workflows with Galaxy and Docker , 2015, GigaScience.

[63]  Dan Brickley,et al.  Google Dataset Search: Building a search engine for datasets in an open Web ecosystem , 2019, WWW.

[64]  Daniele Fanelli,et al.  Opinion: Is science really facing a reproducibility crisis, and do we need it to? , 2018, Proceedings of the National Academy of Sciences.

[65]  H. Anzt,et al.  An environment for sustainable research software in Germany and beyond: current state, open challenges, and call for action. , 2021, F1000Research.

[66]  Justin Bedő,et al.  BioShake: a Haskell EDSL for bioinformatics workflows , 2019, PeerJ.

[68]  Helen Shen,et al.  Interactive notebooks: Sharing the code , 2014, Nature.

[69]  C. Ball,et al.  Repeatability of published microarray gene expression analyses , 2009, Nature Genetics.

[70]  Ashish Sharma,et al.  The public cancer radiology imaging collections of The Cancer Imaging Archive , 2017, Scientific Data.

[71]  Charles C Horn,et al.  Neurophysiological analytics for all! Free open-source software tools for documenting, analyzing, visualizing, and sharing using electronic notebooks. , 2016, Journal of neurophysiology.

[72]  William Michael Landau,et al.  The targets R package: a dynamic Make-like function-oriented pipeline toolkit for reproducibility and high-performance computing , 2021, J. Open Source Softw..

[73]  Maryann E. Martone,et al.  RRIDs: A Simple Step toward Improving Reproducibility through Rigor and Transparency of Experimental Methods , 2016, Neuron.

[74]  Stephen R. Piccolo,et al.  Tools and techniques for computational reproducibility , 2016, GigaScience.

[75]  Ka Yee Yeung,et al.  Building containerized workflows using the BioDepot-workflow-builder (Bwb) , 2017, bioRxiv.

[76]  Frederico T. Fonseca,et al.  Geospatial Semantic Web , 2017, Encyclopedia of GIS.

[77]  Jens Lehmann,et al.  MEX vocabulary: a lightweight interchange format for machine learning experiments , 2015, SEMANTICS.

[78]  Michael Bryce,et al.  Test 5.14.4. Deposit 18 June 15:43, embargoed 18/07/2019 : Article -> Review article , 2019 .

[79]  Carole A. Goble,et al.  Towards the Preservation of Scientific Workflows , 2011, iPRES.

[80]  Anna-Lena Lamprecht,et al.  Automated workflow composition in mass spectrometry-based proteomics , 2018, Bioinform..

[81]  Christian Collberg,et al.  Measuring Reproducibility in Computer Systems Research , 2014 .

[82]  R. Graham,et al.  DICOM demystified: a review of digital file formats and their use in radiological practice. , 2005, Clinical radiology.

[83]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[84]  Yihui Xie,et al.  knitr: A Comprehensive Tool for Reproducible Research in R , 2018, Implementing Reproducible Research.

[85]  Joachim Wackerow,et al.  DDI as a Common Format for Export and Import for Statistical Packages , 2015 .

[86]  William D. Lees,et al.  Diversity in immunogenomics: the value and the challenge , 2020, Nature Methods.

[87]  Improving the completeness of public metadata accompanying omics studies , 2021, Genome biology.

[88]  Krzysztof Janowicz,et al.  Five stars of Linked Data vocabulary use , 2014, Semantic Web.

[89]  Alfred O. Hero,et al.  The Ontology of Biological and Clinical Statistics (OBCS) for standardized and reproducible statistical analysis , 2016, J. Biomed. Semant..

[90]  Jeremy Leipzig,et al.  A review of bioinformatic pipeline frameworks , 2016, Briefings Bioinform..

[91]  William Stafford Noble,et al.  Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project , 2007, Nature.

[92]  Oliver Hofmann,et al.  ISA software suite: supporting standards-compliant experimental annotation and enabling curation at the community level , 2010, Bioinform..

[93]  Christian Kray,et al.  Creating Interactive Scientific Publications using Bindings , 2018, Proc. ACM Hum. Comput. Interact..

[94]  F. Arnaud,et al.  From core referencing to data re-use: two French national initiatives to reinforce paleodata stewardship (National Cyber Core Repository and LTER France Retro-Observatory) , 2017 .

[95]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[96]  Petko Valtchev,et al.  Towards an ontology-based recommender system for relevant bioinformatics workflows , 2016, bioRxiv.

[97]  Olivier Sallou,et al.  BioShaDock: a community driven bioinformatics shared Docker-based tools registry , 2015, F1000Research.

[98]  Sun Huh Journal Article Tag Suite 1.0: National Information Standards Organization standard of journal extensible markup language , 2014 .

[99]  Juliana Freire,et al.  A Survey on Collecting, Managing, and Analyzing Provenance from Scripts , 2019, ACM Comput. Surv..

[100]  Brian Dobreski,et al.  Metadata and Reproducibility: A Case Study of Gravitational Wave Research Data Management , 2016 .

[101]  Timothy Clark,et al.  The importance of software citation , 2020 .

[102]  Gaurav Kaushik,et al.  Rabix: an open-source workflow executor supporting recomputability and interoperability of workflow descriptions , 2016, bioRxiv.

[103]  Michael D. Frenkel,et al.  ThermoML-An XML-based approach for storage and exchange of experimental and critically evaluated thermophysical and thermochemical property data. 2. Uncertainties , 2003 .

[104]  Dennis Shasha,et al.  ReproZip: Computational Reproducibility With Ease , 2016, SIGMOD Conference.

[105]  Daniel Nüst,et al.  Reproducible research and GIScience: an evaluation using AGILE conference papers , 2018, PeerJ.

[106]  David L. Robertson,et al.  Methodology capture: discriminating between the "best" and the rest of community practice , 2008, BMC Bioinformatics.

[107]  M. Vassar,et al.  Reproducible and transparent research practices in published neurology research , 2019, bioRxiv.

[108]  Matthew Kim,et al.  ProvCaRe: Characterizing scientific reproducibility of biomedical research studies using semantic provenance metadata , 2019, Int. J. Medical Informatics.

[109]  Clayton T. Morrison,et al.  Large-scale automated machine reading discovers new cancer-driving mechanisms , 2018, Database J. Biol. Databases Curation.

[110]  Juliane Fluck,et al.  The BEL information extraction workflow (BELIEF): evaluation in the BioCreative V BEL and IAT track , 2016, Database J. Biol. Databases Curation.

[111]  Michael Aufreiter,et al.  The Reproducible Document Stack reinvents the journal publication for a world of computationally reproducible research , 2018, RO.

[112]  Carl Boettiger,et al.  An introduction to Docker for reproducible research , 2014, OPSR.

[113]  Juliana Freire,et al.  A Large-Scale Study About Quality and Reproducibility of Jupyter Notebooks , 2019, 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR).

[114]  Carole Goble,et al.  RO-Crate, a lightweight approach to Research Object data packaging , 2019, RO.

[115]  Cláudio T. Silva,et al.  Making Computations and Publications Reproducible with VisTrails , 2012, Computing in Science & Engineering.

[116]  Liz Woolcott,et al.  Understanding Metadata: What is Metadata, and What is it For?, , 2017 .

[117]  Christophe Hurlin,et al.  Certify reproducibility with confidential data , 2019, Science.

[118]  Michel Dumontier,et al.  Bio2RDF Release 3: A larger, more connected network of Linked Data for the Life Sciences , 2014, SEMWEB.

[119]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[120]  Lorena A. Barba,et al.  Terminologies for Reproducible Research , 2018, ArXiv.

[121]  Peter J. Hunter,et al.  An Overview of CellML 1.1, a Biological Model Description Language , 2003, Simul..

[122]  Harald Barsnes,et al.  BioContainers: an open-source and community-driven framework for software standardization , 2017, Bioinform..

[123]  Martín Ugarte,et al.  Foundations of JSON Schema , 2016, WWW.

[124]  S. Eglen,et al.  CODECHECK: an Open Science initiative for the independent execution of computations underlying research articles during peer review to improve reproducibility. , 2021, F1000Research.

[125]  Carole A. Goble,et al.  Micropublications: a semantic model for claims, evidence, arguments and annotations in biomedical communications , 2013, Journal of Biomedical Semantics.

[126]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[127]  Tool recommender system in Galaxy using deep learning , 2021, GigaScience.

[128]  Ali Ghodsi,et al.  Accelerating the Machine Learning Lifecycle with MLflow , 2018, IEEE Data Eng. Bull..

[129]  Lars Kotthoff,et al.  Case Studies and Challenges in Reproducibility in the Computational Sciences , 2014, 1408.2123.

[130]  Maria-Esther Vidal,et al.  An automatic method for the enrichment of DICOM metadata using biomedical ontologies , 2015, 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[131]  David D. McDonald Issues in the representation of real texts: the design of KRISP , 2000 .

[132]  Hans H. Cheng,et al.  Coordinated international action to accelerate genome-to-phenome with FAANG, the Functional Annotation of Animal Genomes project , 2015, Genome Biology.

[133]  Mikhail G. Dozmorov,et al.  GitHub Statistics as a Measure of the Impact of Open-Source Bioinformatics Software , 2018, Front. Bioeng. Biotechnol..

[134]  María S. Pérez-Hernández,et al.  Reproducibility of execution environments in computational science using Semantics and Clouds , 2017, Future Gener. Comput. Syst..

[135]  Tim Head,et al.  Binder 2.0 - Reproducible, interactive, sharable environments for science at scale , 2018, SciPy.

[136]  Ka Yee Yeung,et al.  Building Containerized Workflows Using the BioDepot-Workflow-Builder. , 2019, Cell systems.

[137]  Jeremiah J. Faith,et al.  Many Microbe Microarrays Database: uniformly normalized Affymetrix compendia with structured experimental metadata , 2007, Nucleic Acids Res..

[138]  Saulius Gražulis,et al.  Specification of the Crystallographic Information File format, version 2.0 , 2016 .

[139]  Thorsten Meinl,et al.  KNIME - the Konstanz information miner: version 2.0 and beyond , 2009, SKDD.

[140]  Mark D. Robinson,et al.  CWL Viewer: The Common Workflow Language Viewer , 2017 .

[141]  Kevin R. Page,et al.  From Workflows to Research Objects: An Architecture for Preserving the Semantics of Science , 2012, LISC@ISWC.

[142]  Charlotte Soneson,et al.  Tximeta: Reference sequence checksums for provenance identification in RNA-seq , 2020, PLoS computational biology.

[143]  S. Eglen,et al.  CODECHECK: an Open Science initiative for the independent execution of computations underlying research articles during peer review to improve reproducibility , 2021, F1000Research.

[144]  Susan B. Van Hemel,et al.  Board on Behavioral, Cognitive, and Sensory Sciences , 1998 .

[145]  Brent S. Pedersen,et al.  Go Get Data (GGD) is a framework that facilitates reproducible access to genomic data , 2021, Nature Communications.

[146]  Alan Ruttenberg,et al.  The SWAN biomedical discourse ontology , 2008, J. Biomed. Informatics.

[147]  Gil Alterovitz,et al.  Enabling precision medicine via standard communication of HTS provenance, analysis, and results , 2017, bioRxiv.

[148]  Michael Mattioli,et al.  Big data, bigger dilemmas: A critical review , 2015, J. Assoc. Inf. Sci. Technol..

[149]  Joaquin Vanschoren,et al.  ML-Schema: Exposing the Semantics of Machine Learning with Schemas and Ontologies , 2018, ICML 2018.

[150]  Jane Greenberg,et al.  Big Metadata, Smart Metadata, and Metadata Capital: Toward Greater Synergy Between Data Science and Metadata , 2017, J. Data Inf. Sci..

[151]  Alexander Sczyrba,et al.  Bioboxes: standardised containers for interchangeable bioinformatics software , 2015, GigaScience.

[152]  Chris J. Myers,et al.  The Systems Biology Markup Language (SBML): Language Specification for Level 3 Version 2 Core Release 2 , 2018, J. Integr. Bioinform..

[153]  Sarala M. Wimalaratne,et al.  The Systems Biology Graphical Notation , 2009, Nature Biotechnology.

[154]  Jens Lehmann,et al.  The KEEN Universe - An Ecosystem for Knowledge Graph Embeddings with a Focus on Reproducibility and Transferability , 2019, SEMWEB.

[155]  Renan Valieris,et al.  Bioconda: sustainable and comprehensive software distribution for the life sciences , 2018, Nature Methods.

[156]  Livia Perfetto,et al.  SIGNOR: a database of causal relationships between biological entities , 2015, Nucleic Acids Res..

[157]  Martin Hofmann-Apitius,et al.  Re-curation and rational enrichment of knowledge graphs in Biological Expression Language , 2019, bioRxiv.

[158]  Sören Auer,et al.  Linked SDMX Data: Path to high fidelity Statistical Linked Data , 2015, Semantic Web.

[159]  Daniel J Cooper,et al.  FAIRshake: Toolkit to Evaluate the FAIRness of Research Digital Resources. , 2019, Cell systems.

[160]  N. Paskin Digital Object Identifier (DOI) System , 2010 .

[161]  C E Lipscomb,et al.  Medical Subject Headings (MeSH). , 2000, Bulletin of the Medical Library Association.

[162]  Milan Sonka,et al.  3D Slicer as an image computing platform for the Quantitative Imaging Network. , 2012, Magnetic resonance imaging.

[163]  Yolanda Gil,et al.  Abstract, link, publish, exploit: An end to end framework for workflow sharing , 2017, Future Gener. Comput. Syst..

[164]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[165]  Darrel C. Ince,et al.  The case for open computer programs , 2012, Nature.

[166]  Dongbo Hu,et al.  Open collaborative writing with Manubot , 2019, PLoS Comput. Biol..

[167]  Robert Stevens,et al.  Extracting patterns of database and software usage from the bioinformatics literature , 2014, Bioinform..

[168]  Ian Foster,et al.  Research Infrastructure for the Safe Analysis of Sensitive Data , 2018 .

[169]  AnHai Doan,et al.  MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive , 2017, Bioinform..

[170]  Casey S. Greene,et al.  Recommendations to enhance rigor and reproducibility in biomedical research , 2020, GigaScience.

[171]  John Chilton,et al.  Portable workflow and tool descriptions with the CWL , 2015 .

[172]  Philip E Bourne,et al.  DOIs for DICOM raw images: enabling science reproducibility. , 2015, Radiology.

[173]  Robert Stevens,et al.  The Software Ontology (SWO): a resource for reproducibility in biomedical data analysis, curation and digital preservation , 2014, Journal of Biomedical Semantics.

[174]  Gary D. Bader,et al.  Pathway Commons, a web resource for biological pathway data , 2010, Nucleic Acids Res..

[175]  James H Stagge,et al.  Assessing data availability and research reproducibility in hydrology and water resources , 2019, Scientific Data.

[176]  Steve Pettifer,et al.  EDAM: an ontology of bioinformatics operations, types of data and identifiers, topics and formats , 2013, Bioinform..

[177]  Justin Bedo Bioshake: a Haskell EDSL for bioinformatics pipelines , 2019 .

[178]  Javier Otegui,et al.  The GBIF Integrated Publishing Toolkit: Facilitating the Efficient Publishing of Biodiversity Data on the Internet , 2014, PloS one.

[179]  Carole Goble,et al.  Sharing interoperable workflow provenance: A review of best practices and their practical application in CWLProv , 2019, GigaScience.

[180]  Victoria Stodden,et al.  Enabling the Verification of Computational Results: An Empirical Evaluation of Computational Reproducibility , 2018, Proceedings of the First International Workshop on Practical Reproducible Evaluation of Computer Systems.

[181]  Sebastian Schelter,et al.  Automatically Tracking Metadata and Provenance of Machine Learning Experiments , 2017 .

[182]  Benjamin M. Gyori,et al.  FamPlex: a resource for entity recognition and relationship resolution of human protein families and complexes in biomedical text mining , 2018, bioRxiv.

[183]  Stephen J. Eglen,et al.  Code Execution in Peer Review , 2021 .

[184]  D. Moher,et al.  CONSORT 2010 Statement: updated guidelines for reporting parallel group randomised trials , 2010, BMC medicine.

[185]  Anton Nekrutenko,et al.  Ten Simple Rules for Reproducible Computational Research , 2013, PLoS Comput. Biol..

[186]  M. Baker 1,500 scientists lift the lid on reproducibility , 2016, Nature.

[187]  Ian M. Mitchell,et al.  Reproducible research for scientific computing: Tools and strategies for changing the culture , 2012, Computing in Science & Engineering.

[188]  Daniel Nüst,et al.  Guerrilla Badges for Reproducible Geospatial Data Science , 2022 .

[189]  Núria Queralt-Rosinach,et al.  DisGeNET-RDF: harnessing the innovative power of the Semantic Web to explore the genetic basis of diseases , 2015, bioRxiv.

[190]  Michael R. Crusoe,et al.  Common Workflow Language , 2015 .

[191]  Hassaan Irshad,et al.  Scaling SPADE to "Big Provenance" , 2016, TaPP.

[192]  Carole A. Goble,et al.  Why Linked Data is Not Enough for Scientists , 2010, 2010 IEEE Sixth International Conference on e-Science.

[193]  Nicholas A. Coles,et al.  Analysis of Open Data and Computational Reproducibility in Registered Reports in Psychology , 2020 .

[194]  Miguel-Ángel Sicilia,et al.  Metadata for Big Data: A preliminary investigation of metadata quality issues in research data repositories , 2014, Inf. Serv. Use.

[195]  I. Hrynaszkiewicz Publishers' Responsibilities in Promoting Data Quality and Reproducibility. , 2019, Handbook of experimental pharmacology.

[196]  Cathryn S. Dippo,et al.  The Role of Metadata in Statistics , 2000 .

[197]  Xin Gao,et al.  Machine learning with biomedical ontologies , 2020, bioRxiv.

[198]  Jonah Lehrer The Truth Wears Off , 2011 .

[199]  James F. Allen,et al.  Deep Semantic Analysis of Text , 2008, STEP.

[200]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[201]  Michael Kohnen,et al.  Quality of DICOM header information for image categorization , 2002, SPIE Medical Imaging.

[202]  Daniel S. Katz,et al.  Software citation principles , 2016, PeerJ Comput. Sci..

[203]  Barbara Lerner,et al.  RDataTracker: Collecting Provenance in an Interactive Scripting Environment , 2014, TAPP.

[204]  Erik Schultes,et al.  A design framework and exemplar metrics for FAIRness , 2017 .

[205]  Chris J. Myers,et al.  The Systems Biology Markup Language (SBML): Language Specification for Level 3 Version 2 Core , 2018, J. Integr. Bioinform..

[206]  Vasileios Stathias,et al.  Sustainable data and metadata management at the BD2K-LINCS Data Coordination and Integration Center , 2018, Scientific Data.

[207]  Michael D. Frenkel,et al.  ThermoML†An XML-Based Approach for Storage and Exchange of Experimental and Critically Evaluated Thermophysical and Thermochemical Property Data. 3. Critically Evaluated Data, Predicted Data, and Equation Representation‡ , 2004 .

[208]  Raja Mazumder,et al.  Biocompute Objects—A Step towards Evaluation and Validation of Biomedical Scientific Computations , 2016, PDA Journal of Pharmaceutical Science and Technology.

[209]  Jane Greenberg,et al.  Understanding Metadata and Metadata Schemes , 2005 .

[210]  John A. Kunze,et al.  Dublin Core Metadata for Resource Discovery , 1998, RFC.

[211]  Ken-ichi Matsumoto,et al.  GitHub Repositories with Links to Academic Papers: Open Access, Traceability, and Evolution , 2020, ArXiv.

[212]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[213]  Barend Mons,et al.  Open PHACTS: semantic interoperability for drug discovery. , 2012, Drug discovery today.

[214]  Benjamin M. Gyori,et al.  From word models to executable models of signaling networks using automated assembly , 2017, bioRxiv.

[215]  Charlotte Soneson,et al.  Tximeta: Reference sequence checksums for provenance identification in RNA-seq , 2019, bioRxiv.

[216]  Helen E. Parkinson,et al.  BioSamples database: an updated sample metadata hub , 2018, Nucleic Acids Res..

[217]  Sören Auer,et al.  Open Research Knowledge Graph: Next Generation Infrastructure for Semantic Scholarly Knowledge , 2019, K-CAP.

[218]  M. Ashburner,et al.  The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration , 2007, Nature Biotechnology.

[219]  Jeremy G Frey,et al.  Cheminformatics and the Semantic Web: adding value with linked data and enhanced provenance , 2013, Wiley interdisciplinary reviews. Computational molecular science.

[220]  Lana S. Martin,et al.  Systematic benchmarking of omics computational tools , 2019, Nature Communications.

[221]  Carole A. Goble,et al.  Taverna: a tool for building and running workflows of services , 2006, Nucleic Acids Res..

[222]  Karthik Ram,et al.  A Realistic Guide to Making Data Available Alongside Code to Improve Reproducibility , 2020, ArXiv.