RepeatFS: a file system providing reproducibility through provenance and automation

MOTIVATION Reproducibility is of central importance to the scientific process. The difficulty of consistently replicating and verifying experimental results is magnified in the era of big data, in which bioinformatics analysis often involves complex multi-application pipelines operating on terabytes of data. These processes result in thousands of possible permutations of data preparation steps, software versions and command-line arguments. Existing reproducibility frameworks are cumbersome and involve redesigning computational methods. To address these issues, we developed RepeatFS, a file system that records, replicates and verifies informatics workflows with no alteration to the original methods. RepeatFS also provides several other features to help promote analytical transparency and reproducibility, including provenance visualization and task automation. RESULTS We used RepeatFS to successfully visualize and replicate a variety of bioinformatics tasks consisting of over a million operations with no alteration to the original methods. RepeatFS correctly identified all software inconsistencies that resulted in replication differences. AVAILABILITYAND IMPLEMENTATION RepeatFS is implemented in Python 3. Its source code and documentation are available at https://github.com/ToniWestbrook/repeatfs. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Richard O. Sinnott,et al.  Investigating reproducibility and tracking provenance – A genomic workflow case study , 2017, BMC Bioinformatics.

[2]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[3]  Willian A. da Silveira,et al.  Genomics pipelines and data integration: challenges and opportunities in the research setting , 2017, Expert review of molecular diagnostics.

[4]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[5]  Sven Rahmann,et al.  Snakemake--a scalable bioinformatics workflow engine. , 2012, Bioinformatics.

[6]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[7]  Leo Goodstadt,et al.  Ruffus: a lightweight Python library for computational pipelines , 2010, Bioinform..

[8]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[9]  Francesco Asnicar,et al.  Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2 , 2019, Nature Biotechnology.

[10]  Yolanda Gil,et al.  Quantifying Reproducibility in Computational Biology: The Case of the Tuberculosis Drugome , 2013, PloS one.

[11]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[12]  Björn Usadel,et al.  Trimmomatic: a flexible trimmer for Illumina sequence data , 2014, Bioinform..

[13]  Sven Rahmann,et al.  Genome analysis , 2022 .

[14]  Torsten Seemann,et al.  Prokka: rapid prokaryotic genome annotation , 2014, Bioinform..

[15]  Jonathan Cooper,et al.  Where next for the reproducibility agenda in computational biology? , 2016, BMC Systems Biology.

[16]  M. Baker 1,500 scientists lift the lid on reproducibility , 2016, Nature.

[17]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[18]  Pelin Yilmaz,et al.  The SILVA ribosomal RNA gene database project: improved data processing and web-based tools , 2012, Nucleic Acids Res..

[19]  Alexandros Stamatakis,et al.  RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies , 2014, Bioinform..

[20]  Jean-Baptiste Poline,et al.  Experimenting with reproducibility: a case study of robustness in bioinformatics , 2018, GigaScience.

[21]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[22]  Bernard J. Pope,et al.  Bpipe: a tool for running and managing bioinformatics pipelines , 2012, Bioinform..

[23]  Bo Peng,et al.  Script of Scripts: A pragmatic workflow system for daily computational research , 2019, PLoS Comput. Biol..

[24]  Farah Magrabi,et al.  Does health informatics have a replication crisis? , 2018, J. Am. Medical Informatics Assoc..