A taxonomy for reproducible and replicable research in environmental modelling

Abstract Despite the growing acknowledgment of reproducibility crisis in computational science, there is still a lack of clarity around what exactly constitutes a reproducible or replicable study in many computational fields, including environmental modelling. To this end, we put forth a taxonomy that defines an environmental modelling study as being either 1) repeatable, 2) runnable, 3) reproducible, or 4) replicable. We introduce these terms with illustrative examples from hydrology using a hydrologic modelling framework along with cyberinfrastructure aimed at fostering reproducibility. Using this taxonomy as a guide, we argue that containerization is an important but lacking component needed to achieve the goal of computational reproducibility in hydrology and environmental modelling. Examples from hydrology are provided to demonstrate how new tools, including a user-friendly tool for containerization of computational analyses called Sciunit, can lower the barrier to reproducibility and replicability in the environmental modelling community.

[1]  Jeffery S. Horsburgh,et al.  HydroShare: Sharing Diverse Environmental Data Types and Models as Social Objects with Application to the Hydrology Domain , 2016 .

[2]  Krzysztof J. Gorgolewski,et al.  A Practical Guide for Improving Transparency and Reproducibility in Neuroimaging Research , 2016, bioRxiv.

[3]  Jonathan M. Borwein,et al.  Setting the Default to Reproducible Reproducibility in Computational and Experimental Mathematics , 2013 .

[4]  Dmitri Kavetski,et al.  A unified approach for process‐based hydrologic modeling: 2. Model implementation and case studies , 2015 .

[5]  David E. Rosenberg,et al.  The Next Frontier: Making Research More Reproducible , 2020, Journal of Water Resources Planning and Management.

[6]  Hao Xu,et al.  Evaluation of the OntoSoft Ontology for describing metadata for legacy hydrologic modeling software , 2017, Environ. Model. Softw..

[7]  Monya Baker,et al.  Muddled meanings hamper efforts to fix reproducibility crisis , 2016, Nature.

[8]  Douglas Thain,et al.  Reproducibility in Scientific Computing , 2018, ACM Comput. Surv..

[9]  Dmitri Kavetski,et al.  A unified approach for process‐based hydrologic modeling: 1. Modeling concept , 2015 .

[10]  R. Peng Reproducible Research in Computational Science , 2011, Science.

[11]  Anthony M. Castronova,et al.  The Development of Sharable pySUMMA Simulation Environment using Singularity on HydroShare , 2018 .

[12]  David E. Rosenberg,et al.  Author Correction: Assessing data availability and research reproducibility in hydrology and water resources , 2019, Scientific Data.

[13]  Daniel Nüst,et al.  Opening the Publication Process with Executable Research Compendia , 2017, D Lib Mag..

[14]  Daniel Nüst,et al.  Reproducibility and Practical Adoption of GEOBIA with Open-Source Software in Docker Containers , 2017, Remote. Sens..

[15]  Douglas Thain,et al.  An invariant framework for conducting reproducible computational science , 2015, J. Comput. Sci..

[16]  C. Ball,et al.  Repeatability of published microarray gene expression analyses , 2009, Nature Genetics.

[17]  Mohamed M. Morsy,et al.  Modeling urban coastal flood severity from crowd-sourced flood reports using Poisson regression and Random Forest , 2018 .

[18]  Jeffery S. Horsburgh,et al.  Design of a metadata framework for environmental models with an example hydrologic application in HydroShare , 2017, Environ. Model. Softw..

[19]  Steve Easterbrook,et al.  Open code for open science , 2014 .

[20]  Matthew J. Turk,et al.  Computing Environments for Reproducibility: Capturing the "Whole Tale" , 2018, Future Gener. Comput. Syst..

[21]  Anthony M. Castronova,et al.  Enabling Collaborative Numerical Modeling in Earth Sciences using Knowledge Infrastructure , 2019, Environ. Model. Softw..

[22]  Mohamed M. Morsy,et al.  Integrating scientific cyberinfrastructures to improve reproducibility in computational hydrology: Example for HydroShare and GeoTrust , 2018, Environ. Model. Softw..

[23]  Richard G. Niswonger,et al.  MODFLOW-NWT, A Newton Formulation for MODFLOW-2005 , 2014 .

[24]  K. Coombes,et al.  Deriving chemosensitivity from cell lines: Forensic bioinformatics and reproducible research in high-throughput biology , 2009, 1010.1092.

[25]  James Taylor,et al.  Next-generation sequencing data interpretation: enhancing reproducibility and accessibility , 2012, Nature Reviews Genetics.

[26]  Dirk Merkel,et al.  Docker: lightweight Linux containers for consistent development and deployment , 2014 .

[27]  Division on Earth,et al.  Reproducibility and Replicability in Science , 2019 .

[28]  Torsten Hothorn,et al.  Case studies in reproducibility , 2011, Briefings Bioinform..

[29]  Victoria Stodden,et al.  ResearchCompendia.org: Cyberinfrastructure for Reproducibility and Collaboration in Computational Science , 2015, Computing in Science & Engineering.

[30]  Alva L. Couch,et al.  HydroShare: Advancing Collaboration through Hydrologic Data and Model Sharing , 2015 .

[31]  Yolanda Gil,et al.  Quantifying Reproducibility in Computational Biology: The Case of the Tuberculosis Drugome , 2013, PloS one.

[32]  Jonathan L. Goodall,et al.  Documenting Computing Environments for Reproducible Experiments , 2019, PARCO.

[33]  Jane Greenberg,et al.  Metadata for Describing Water Models , 2014 .

[34]  Tanu Malik,et al.  Sciunits: Reusable Research Objects , 2017, 2017 IEEE 13th International Conference on e-Science (e-Science).

[35]  Nick McKeown,et al.  Reproducible network experiments using container-based emulation , 2012, CoNEXT '12.

[36]  Daniel Nüst,et al.  containerit: Generating Dockerfiles for reproducible research with R , 2019, J. Open Source Softw..

[37]  Eric F. Wood,et al.  One-dimensional statistical dynamic representation of subgrid spatial variability of precipitation in the two-layer variable infiltration capacity model , 1996 .

[38]  Vanessa Sochat,et al.  Singularity: Scientific containers for mobility of compute , 2017, PloS one.

[39]  Ben Marwick,et al.  Packaging Data Analytical Work Reproducibly Using R (and Friends) , 2018 .

[40]  Stephen R. Piccolo,et al.  Tools and techniques for computational reproducibility , 2016, GigaScience.

[41]  Carole A. Goble,et al.  Why Linked Data is Not Enough for Scientists , 2010, 2010 IEEE Sixth International Conference on e-Science.

[42]  Jane Huffman Hayes,et al.  Towards reproducible research: automatic classification of empirical requirements engineering papers , 2018, ACM Southeast Regional Conference.

[43]  Alva L. Couch,et al.  HydroShare: Advancing Hydrology through Collaborative Data and Model Sharing , 2015 .

[44]  John P. A. Ioannidis,et al.  What does research reproducibility mean? , 2016, Science Translational Medicine.

[45]  James H Stagge,et al.  Assessing data availability and research reproducibility in hydrology and water resources , 2019, Scientific Data.

[46]  Robert E. Kearney,et al.  A HUPO test sample study reveals common problems in mass spectrometry-based proteomics , 2009, Nature Methods.

[47]  M. Baker 1,500 scientists lift the lid on reproducibility , 2016, Nature.

[48]  Reagan Moore,et al.  Using a data grid to automate data preparation pipelines required for regional-scale hydrologic modeling , 2016, Environ. Model. Softw..

[49]  Suzanne A. Pierce,et al.  Toward the Geoscience Paper of the Future: Best practices for documenting and sharing research from data to software to provenance , 2016 .

[50]  Christopher Hutton,et al.  Most computational hydrology is not reproducible, so is it really science? , 2016, Water Resources Research.

[51]  Tanu Malik,et al.  Utilizing Provenance in Reusable Research Objects , 2018, Informatics.

[52]  Brian E. Granger,et al.  IPython: A System for Interactive Scientific Computing , 2007, Computing in Science & Engineering.

[53]  Yolanda Gil,et al.  OntoSoft: Capturing Scientific Software Metadata , 2015, K-CAP.