ZARP: A user-friendly and versatile RNA-seq analysis workflow

Background RNA sequencing (RNA-seq) is a widely used technique in many scientific studies. Given the plethora of models and software packages that have been developed for processing and analyzing RNA-seq datasets, choosing the most appropriate ones is a time-consuming process that requires an in-depth understanding of the data, as well as of the principles and parameters of each tool. In addition, packages designed for individual tasks are developed in different programming languages and have dependencies of various degrees of complexity, which renders their installation and execution challenging for users with limited computational expertise. Workflow languages and execution engines with support for virtualization and encapsulation options such as containers and Conda environments facilitate these tasks considerably. The resulting computational workflows can then be reliably shared with the scientific community, enhancing reusability and the reproducibility of results as individual analysis steps are becoming more transparent and portable. Methods Here we present ZARP, a general purpose RNA-seq analysis workflow that builds on state-of-the-art software in the field to facilitate the analysis of RNA-seq datasets. ZARP is developed in the Snakemake workflow language and can run locally or in a cluster environment, generating extensive reports not only of the data but also of the options utilized. It is built using modern technologies with the ultimate goal to reduce the hands-on time for bioinformaticians and non-expert users and serve as a template for future workflow development. To this end, we also provide ZARP-cli, a dedicated command-line interface that may make running ZARP on an RNA-seq library of interest as easy as executing a single two-word command. Conclusions ZARP is a powerful RNA-seq analysis workflow that is easy to use even for beginners, built using best software development practices, available under a permissive Open Source license and open to contributions by the scientific community.

[1]  M. Crusoe,et al.  Recording provenance of workflow runs with RO-Crate , 2023, ArXiv.

[2]  Matthew R. Gazzara,et al.  Extensible benchmarking of methods that identify and quantify polyadenylation sites from RNA-seq data , 2023, RNA.

[3]  Pieter B. T. Neerincx,et al.  Ten quick tips for building FAIR workflows , 2023, PLoS Comput. Biol..

[4]  C. Boettiger,et al.  Containers for computational reproducibility , 2023, Nature Reviews Methods Primers.

[5]  C. T. Brown,et al.  Ten simple rules and a template for creating workflows-as-applications , 2022, PLoS Comput. Biol..

[6]  D. Katz,et al.  Introducing the FAIR Principles for research software , 2022, Scientific Data.

[7]  S. Frölich,et al.  genomepy: genes and genomes at your fingertips , 2022, Bioinform..

[8]  Alan R. Williams,et al.  Ten simple rules for making a software tool workflow-ready , 2022, PLoS Comput. Biol..

[9]  K. Katz,et al.  The Sequence Read Archive: a decade more of explosive growth , 2021, Nucleic Acids Res..

[10]  M. Zavolan,et al.  ZARP: An automated workflow for processing of RNA-seq data , 2021, bioRxiv.

[11]  A. Wilm,et al.  Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers , 2021, Nature Methods.

[12]  Fabian J Theis,et al.  Over 1000 tools reveal trends in the single-cell RNA-seq analysis landscape , 2021, Genome Biology.

[13]  Andrew J. Duncan,et al.  The Dockstore: enhancing a community platform for sharing reproducible and accessible computational protocols , 2021, Nucleic Acids Res..

[14]  Jernej Ule,et al.  CLIP and complementary methods , 2021, Nature Reviews Methods Primers.

[15]  J. Vizcaíno,et al.  BioContainers Registry: searching bioinformatics and proteomics tools, packages, and containers , 2021, Journal of proteome research.

[16]  Sven Rahmann,et al.  Sustainable data analysis with Snakemake , 2021, F1000Research.

[17]  Astrid Gall,et al.  Ensembl 2021 , 2020, Nucleic Acids Res..

[18]  Lei Xu,et al.  BP4RNAseq: a babysitter package for retrospective and newly generated RNA-seq data analyses using both alignment-based and alignment-free quantification method , 2020, Bioinform..

[19]  M. Rich,et al.  The neuromuscular junction is a focal point of mTORC1 signaling in sarcopenia , 2020, Nature Communications.

[20]  Sven Nahnsen,et al.  The nf-core framework for community-curated bioinformatics pipelines , 2020, Nature Biotechnology.

[21]  Xiaokang Zhang,et al.  RASflow: an RNA-Seq analysis workflow with Snakemake , 2019, BMC Bioinformatics.

[22]  M. Zavolan,et al.  PolyASite 2.0: a consolidated atlas of polyadenylation sites from 3′ end sequencing , 2019, Nucleic Acids Res..

[23]  Jeffrey M. Perkel,et al.  Workflow systems turn raw data into scientific knowledge , 2019, Nature.

[24]  O. Kohlbacher,et al.  Challenges of big data integration in the life sciences , 2019, Analytical and Bioanalytical Chemistry.

[25]  Matthias Becker,et al.  Shiny-Seq: advanced guided transcriptome analysis , 2019, BMC Research Notes.

[26]  H. Le Hir,et al.  ALFA: annotation landscape for aligned reads , 2019, BMC Genomics.

[27]  Marilyn Safran,et al.  UTAP: User-friendly Transcriptome Analysis Pipeline , 2019, BMC Bioinformatics.

[28]  Mark D. Robinson,et al.  ARMOR: An Automated Reproducible MOdular Workflow for Preprocessing and Differential Analysis of RNA-seq Data , 2019, G3: Genes, Genomes, Genetics.

[29]  Levin Cl'ement,et al.  A data-supported history of bioinformatics tools , 2018, 1807.06808.

[30]  Brent S. Pedersen,et al.  Bioconda: sustainable and comprehensive software distribution for the life sciences , 2018, Nature Methods.

[31]  Henry W. Long,et al.  VIPER: Visualization Pipeline for RNA-seq, a Snakemake workflow for efficient and complete RNA-seq analysis , 2018, BMC Bioinformatics.

[32]  Wing Hung Wong,et al.  Gaining comprehensive biological insight into the transcriptome by performing a broad-spectrum RNA-seq analysis , 2017, Nature Communications.

[33]  Daniel S. Katz,et al.  Four simple recommendations to encourage best practices in research software , 2017, F1000Research.

[34]  Vanessa Sochat,et al.  Singularity: Scientific containers for mobility of compute , 2017, PloS one.

[35]  Paolo Di Tommaso,et al.  Nextflow enables reproducible computational workflows , 2017, Nature Biotechnology.

[36]  Geet Duggal,et al.  Salmon: fast and bias-aware quantification of transcript expression using dual-phase inference , 2017, Nature Methods.

[37]  Eun Ji Kim,et al.  Simulation-based comprehensive benchmarking of RNA-seq aligners , 2016, Nature Methods.

[38]  Måns Magnusson,et al.  MultiQC: summarize analysis results for multiple tools and samples in a single report , 2016, Bioinform..

[39]  Farren J. Isaacs,et al.  Erratum to: The real cost of sequencing: scaling computation to keep pace with data generation , 2016, Genome Biology.

[40]  Dmitri D. Pervouchine,et al.  A benchmark for RNA-seq quantification pipelines , 2016, Genome Biology.

[41]  Lior Pachter,et al.  Near-optimal probabilistic RNA-seq quantification , 2016, Nature Biotechnology.

[42]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[43]  Liewei Wang,et al.  Measure transcript integrity using RNA-seq data , 2016, BMC Bioinformatics.

[44]  Daniel J. Gaffney,et al.  A survey of best practices for RNA-seq data analysis , 2016, Genome Biology.

[45]  M. Robinson,et al.  Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences , 2015, F1000Research.

[46]  M. Zavolan,et al.  Comparative assessment of methods for the computational inference of transcript isoform abundance from RNA-seq data , 2015, Genome Biology.

[47]  Carl Boettiger,et al.  An introduction to Docker for reproducible research , 2014, OPSR.

[48]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[49]  David Haussler,et al.  The UCSC genome browser and associated tools , 2012, Briefings Bioinform..

[50]  Wei Li,et al.  RSeQC: quality control of RNA-seq experiments , 2012, Bioinform..

[51]  Vincent J. Lynch,et al.  Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples , 2012, Theory in Biosciences.

[52]  Y. Benjamini,et al.  Summarizing and correcting the GC content bias in high-throughput sequencing , 2012, Nucleic acids research.

[53]  Marcel Martin Cutadapt removes adapter sequences from high-throughput sequencing reads , 2011 .

[54]  Peter M. Rice,et al.  The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants , 2009, Nucleic acids research.

[55]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[56]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..