Using rapid prototyping to choose a bioinformatics workflow management system

Workflow management systems represent, manage, and execute multi-step computational analyses and offer many benefits to bioinformaticians. They provide a common language for describing analysis workflows, contributing to reproducibility and to building libraries of reusable components. They can support both incremental build and re-entrancy – the ability to selectively re-execute parts of a workflow in the presence of additional inputs or changes in configuration and to resume execution from where a workflow previously stopped. Many workflow management systems enhance portability by supporting the use of containers, high-performance computing systems and clouds. Most importantly, workflow management systems allow bioinformaticians to delegate how their workflows are run to the workflow management system and its developers. This frees the bioinformaticians to focus on the content of these workflows, their data analyses, and their science. RiboViz is a package to extract biological insight from ribosome profiling data to help advance understanding of protein synthesis. At the heart of RiboViz is an analysis workflow, implemented in a Python script. To conform to best practices for scientific computing which recommend the use of build tools to automate workflows and to re-use code instead of rewriting it, the authors reimplemented this workflow within a workflow management system. To select a workflow management system, a rapid survey of available systems was undertaken, and candidates were shortlisted: Snakemake, cwltool and Toil (implementations of the Common Workflow Language) and Nextflow. An evaluation of each candidate, via rapid prototyping of a subset of the RiboViz workflow, was performed and Nextflow was chosen. The selection process took 10 person-days, a small cost for the assurance that Nextflow best satisfied the authors’ requirements. This use of rapid prototyping can offer a low-cost way of making a more informed selection of software to use within projects, rather than relying solely upon reviews and recommendations by others. Author summary Data analysis involves many steps, as data are wrangled, processed, and analysed using a succession of unrelated software packages. Running all the right steps, in the right order, with the right outputs in the right places is a major source of frustration. Workflow management systems require that each data analysis step be “wrapped” in a structured way, describing its inputs, parameters, and outputs. By writing these wrappers the scientist can focus on the meaning of each step, which is the interesting part. The system uses these wrappers to decide what steps to run and how to run these, and takes charge of running the steps, including reporting on errors. This makes it much easier to repeatedly run the analysis and to run it transparently upon different computers. To select a workflow management system, we surveyed available tools and selected three for “rapid prototype” implementations to evaluate their suitability for our project. We advocate this rapid prototyping as a low-cost (both time and effort) way of making an informed selection of a system for use within a project. We conclude that many similar multi-step data analysis workflows can be rewritten in a workflow management system.

[1]  Chauncey E. Wilson,et al.  User Experience Re-Mastered: Your Guide to Getting the Right Design , 2009 .

[2]  John Chilton,et al.  Common Workflow Language, v1.0 , 2016 .

[3]  Shannon E. K. Joslin,et al.  Streamlining data-intensive biology with workflow systems , 2020, bioRxiv.

[4]  Sven Rahmann,et al.  Snakemake--a scalable bioinformatics workflow engine. , 2012, Bioinformatics.

[5]  Ian M. Mitchell,et al.  Best Practices for Scientific Computing , 2012, PLoS biology.

[6]  Jakob Nielsen,et al.  Usability , 2009 .

[7]  Nicola J. Mulder,et al.  Developing reproducible bioinformatics analysis workflows for heterogeneous computing environments to support African genomics , 2018, BMC Bioinformatics.

[8]  Aaron R. Quinlan,et al.  Bioinformatics Applications Note Genome Analysis Bedtools: a Flexible Suite of Utilities for Comparing Genomic Features , 2022 .

[9]  Marijn van Vliet Seven quick tips for analysis scripts in neuroimaging , 2020, PLoS Comput. Biol..

[10]  Joshua B. Plotkin,et al.  riboviz: analysis and visualization of ribosome profiling datasets , 2017, BMC Bioinformatics.

[11]  Sven Nahnsen,et al.  The nf-core framework for community-curated bioinformatics pipelines , 2020, Nature Biotechnology.

[12]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[13]  Paolo Di Tommaso,et al.  Nextflow enables reproducible computational workflows , 2017, Nature Biotechnology.

[14]  Steven L Salzberg,et al.  HISAT: a fast spliced aligner with low memory requirements , 2015, Nature Methods.

[15]  Jeremy Leipzig,et al.  A review of bioinformatic pipeline frameworks , 2016, Briefings Bioinform..

[16]  A. Heger,et al.  UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy , 2016, bioRxiv.

[17]  Mary Goldman,et al.  Toil enables reproducible, open source, big biomedical data analyses , 2017, Nature Biotechnology.

[18]  Jeffrey M. Perkel,et al.  Workflow systems turn raw data into scientific knowledge , 2019, Nature.

[19]  Marcel Martin Cutadapt removes adapter sequences from high-throughput sequencing reads , 2011 .

[20]  Ashley Shade,et al.  Computing Workflows for Biologists: A Roadmap , 2015, PLoS biology.