Towards Long-term and Archivable Reproducibility

Reproducible workflow solutions commonly use high-level technologies that were popular when they were created, providing an immediate solution which is unlikely to be sustainable in the long term. We therefore introduce a set of criteria to address this problem and demonstrate their practicality and implementation. The criteria have been tested in several research publications and can be summarized as: completeness (no dependency beyond a POSIX-compatible operating system, no administrator privileges, no network connection and storage primarily in plain text); modular design; minimal complexity; scalability; verifiable inputs and outputs; temporal provenance; linking analysis with narrative; and free-and-open-source software. As a proof of concept, we have implemented "Maneage", a solution which stores the project in machine-actionable and human-readable plain-text, enables version-control, cheap archiving, automatic parsing to extract data provenance, and peer-reviewable verification. We show that requiring longevity of a reproducible workflow solution is realistic, without sacrificing immediate or short-term reproducibility and discuss the benefits of the criteria for scientific progress. This paper has itself been written in Maneage, with snapshot 1637cce.

[1]  Alan Edelman,et al.  Julia: A Fresh Approach to Numerical Computing , 2014, SIAM Rev..

[2]  Carole A. Goble,et al.  Using a suite of ontologies for preserving workflow-centric research objects , 2015, J. Web Semant..

[3]  Lorena A. Barba,et al.  Reproducible Workflow on a Public Cloud for Computational Fluid Dynamics , 2019, Computing in Science & Engineering.

[4]  Yolanda Gil,et al.  A semantic framework for automatic generation of computational workflows using distributed data and component catalogues , 2011, J. Exp. Theor. Artif. Intell..

[5]  Tim Jenness Modern Python at the Large Synoptic Survey Telescope , 2017 .

[6]  Lynley A. Wallis,et al.  The archaeology, chronology and stratigraphy of Madjedbebe (Malakunanja II): A site in northern Australia with early occupation. , 2015, Journal of Human Evolution.

[7]  Pierre Alliez,et al.  Attributing and Referencing (Research) Software: Best Practices and Outlook From Inria , 2019, Computing in Science & Engineering.

[8]  Konrad Hinsen,et al.  A data and code model for reproducible research and executable papers , 2011, ICCS.

[9]  J. Mesirov,et al.  GenePattern 2.0 , 2006, Nature Genetics.

[10]  David L. Donoho,et al.  A Universal Identifier for Computational Results , 2011, ICCS.

[11]  Konrad Hinsen Scientific notations for the digital era , 2016, ArXiv.

[12]  Jon F. Claerbout,et al.  Electronic documents give reproducible research a new meaning: 62nd Ann , 1992 .

[13]  Martina Stockhause,et al.  Key components of data publishing: using current best practices to develop a reference model for data publishing , 2017, International Journal on Digital Libraries.

[14]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[15]  Douglas Thain,et al.  Umbrella: A Portable Environment Creator for Reproducible Computing on Clusters, Clouds, and Grids , 2015, VTDC@HPDC.

[16]  et al.,et al.  Jupyter Notebooks - a publishing format for reproducible computational workflows , 2016, ELPUB.

[17]  Ludovic Courtès,et al.  Reproducible and User-Controlled Software Environments in HPC with Guix , 2015, Euro-Par Workshops.

[18]  Sven Rahmann,et al.  Genome analysis , 2022 .

[19]  Daniel Nüst,et al.  Ten simple rules for writing Dockerfiles for reproducible data science , 2020, PLoS computational biology.

[20]  Carole A. Goble,et al.  Why Linked Data is Not Enough for Scientists , 2010, 2010 IEEE Sixth International Conference on e-Science.

[21]  Andreas Heger,et al.  CGAT-core: a python framework for building scalable, reproducible computational biology workflows , 2019, bioRxiv.

[22]  Douglas Thain,et al.  Facilitating the Reproducibility of Scientific Workflows with Execution Environment Specifications , 2017, ICCS.

[23]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[24]  Paolo Di Tommaso,et al.  Nextflow enables reproducible computational workflows , 2017, Nature Biotechnology.

[25]  Matthias Schwab,et al.  Making scientific computations reproducible , 2000, Comput. Sci. Eng..

[26]  Matthew R. Pocock,et al.  Taverna: a tool for the composition and enactment of bioinformatics workflows , 2004, Bioinform..

[27]  Dennis Shasha,et al.  A model project for reproducible papers: critical temperature for the Ising model on a square lattice , 2014, ArXiv.

[28]  Carole A. Goble,et al.  Why workflows break — Understanding and combating decay in Taverna workflows , 2012, 2012 IEEE 8th International Conference on E-Science.

[29]  Merijn de Jonge,et al.  Nix: A Safe and Policy-Free System for Software Deployment , 2004, LISA.

[30]  Victoria Stodden,et al.  Reproducible Research , 2019, The New Statistics with R.

[31]  Mohammad Akhlaghi Carving out the low surface brightness universe with NoiseChisel , 2019, ArXiv.

[32]  Donald E. Knuth,et al.  Literate Programming , 1984, Comput. J..

[33]  T. Ichikawa,et al.  NOISE-BASED DETECTION AND SEGMENTATION OF NEBULOUS OBJECTS , 2015, 1505.01664.

[34]  Steffen Mazanek,et al.  SHARE: a web portal for creating and sharing executable research papers , 2011, ICCS.

[35]  Cláudio T. Silva,et al.  VisTrails: enabling interactive multiple-view visualizations , 2005, VIS 05. IEEE Visualization, 2005..

[36]  Vanessa Sochat,et al.  Singularity: Scientific containers for mobility of compute , 2017, PloS one.

[37]  Luís Oliveira,et al.  Supporting Long-term Reproducible Software Execution , 2018 .

[38]  Juliana Freire,et al.  Tackling the Provenance Challenge one layer at a time , 2008, Concurr. Comput. Pract. Exp..

[39]  Raúl Infante-Sainz,et al.  The Sloan Digital Sky Survey extended point spread functions , 2019, Monthly Notices of the Royal Astronomical Society.

[40]  Matthew J. Turk,et al.  Computing Environments for Reproducibility: Capturing the "Whole Tale" , 2018, Future Gener. Comput. Syst..

[41]  Helga Thorvaldsdóttir,et al.  The GenePattern Notebook Environment. , 2017, Cell systems.

[42]  Andrea C. Arpaci-Dusseau,et al.  The Popper Convention: Making Reproducible Systems Evaluation Practical , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[43]  Andrew P. Davison Automated Capture of Experiment Context for Easier Reproducibility in Computational Research , 2012, Computing in Science & Engineering.

[44]  Paul Sava,et al.  Madagascar: open-source software project for multidimensional data analysis and reproducible computational experiments , 2013 .

[45]  Anton Nekrutenko,et al.  Ten Simple Rules for Reproducible Computational Research , 2013, PLoS Comput. Biol..

[46]  Stuart I. Feldman,et al.  Make — a program for maintaining computer programs , 1979, Softw. Pract. Exp..

[47]  Zoltan Somogyi,et al.  Cake: a fifth generation version of make , 1987 .

[48]  Konrad Hinsen,et al.  ActivePapers: a platform for publishing and archiving computer-aided research. , 2014, F1000Research.

[49]  Douglas Thain,et al.  An invariant framework for conducting reproducible computational science , 2015, J. Comput. Sci..

[50]  Ian T. Foster,et al.  SOLE: Linking Research Papers with Science Objects , 2012, IPAW.

[51]  Edward A. Lee,et al.  Taming heterogeneity - the Ptolemy approach , 2003, Proc. IEEE.

[52]  Rolf Backofen,et al.  Practical computational reproducibility in the life sciences , 2017, bioRxiv.

[53]  Ilkay Altintas,et al.  Ten Simple Rules for Reproducible Research in Jupyter Notebooks , 2018, ArXiv.

[54]  Daniel Nüst,et al.  Publishing computational research - a review of infrastructures for reproducible and transparent scholarly communication , 2020, Research integrity and peer review.

[55]  Jay F. Lofstead,et al.  Data Pallets: Containerizing Storage For Reproducibility and Traceability , 2018, ISC Workshops.

[56]  Anita Bandrowski,et al.  Rigor and Transparency Index, a new metric of quality for assessing biological and medical science methods , 2020, bioRxiv.

[58]  K. Hinsen,et al.  Memory effects in a random walk description of protein structure ensembles. , 2019, The Journal of chemical physics.

[59]  Division on Earth,et al.  Reproducibility and Replicability in Science , 2019 .

[60]  Piotr Nowakowski,et al.  The Collage Authoring Environment , 2011, ICCS.

[61]  Rebecca Capone,et al.  Executable Paper Grand Challenge Workshop , 2011, ICCS.

[62]  Roberto Di Cosmo,et al.  Identifiers for Digital Objects: The case of software source code preservation , 2018, iPRES.

[63]  David L. Donoho,et al.  WaveLab and Reproducible Research , 1995 .

[64]  Jean-Michel Morel,et al.  International Conference on Computational Science , ICCS 2011 The IPOL Initiative : Publishing and Testing Algorithms on Line for Reproducible Research in Image Processing , 2011 .

[65]  James D. Hollan,et al.  Exploration and Explanation in Computational Notebooks , 2018, CHI.

[66]  Juliana Freire,et al.  A Large-Scale Study About Quality and Reproducibility of Jupyter Notebooks , 2019, 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR).

[67]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[68]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[69]  Bronis R. de Supinski,et al.  The Spack package manager: bringing order to HPC software chaos , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[70]  Florian G. Pflug,et al.  Insertion Pool Sequencing for Insertional Mutant Analysis in Complex Host‐Microbe Interactions , 2019, Current protocols in plant biology.

[71]  Dennis Shasha,et al.  ReproZip: Computational Reproducibility With Ease , 2016, SIGMOD Conference.