uap: reproducible and robust HTS data analysis

BackgroundA lack of reproducibility has been repeatedly criticized in computational research. High throughput sequencing (HTS) data analysis is a complex multi-step process. For most of the steps a range of bioinformatic tools is available and for most tools manifold parameters need to be set. Due to this complexity, HTS data analysis is particularly prone to reproducibility and consistency issues. We have defined four criteria that in our opinion ensure a minimal degree of reproducible research for HTS data analysis. A series of workflow management systems is available for assisting complex multi-step data analyses. However, to the best of our knowledge, none of the currently available work flow management systems satisfies all four criteria for reproducible HTS analysis.ResultsHere we present uap, a workflow management system dedicated to robust, consistent, and reproducible HTS data analysis. uap is optimized for the application to omics data, but can be easily extended to other complex analyses. It is available under the GNU GPL v3 license at https://github.com/yigbt/uap.Conclusionsuap is a freely available tool that enables researchers to easily adhere to reproducible research principles for HTS data analyses.

[1]  Friedrich Leisch,et al.  Sweave: Dynamic Generation of Statistical Reports Using Literate Data Analysis , 2002, COMPSTAT.

[2]  Andy B. Yoo,et al.  Approved for Public Release; Further Dissemination Unlimited X-ray Pulse Compression Using Strained Crystals X-ray Pulse Compression Using Strained Crystals , 2002 .

[3]  Dustin E. Schones,et al.  High-Resolution Profiling of Histone Methylations in the Human Genome , 2007, Cell.

[4]  Brian E. Granger,et al.  IPython: A System for Interactive Scientific Computing , 2007, Computing in Science & Engineering.

[5]  Thorsten Meinl,et al.  KNIME: The Konstanz Information Miner , 2007, GfKl.

[6]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[7]  Peter F. Stadler,et al.  Fast Mapping of Short Sequences with Mismatches, Insertions and Deletions Using Index Structures , 2009, PLoS Comput. Biol..

[8]  Sergey Fomel,et al.  Guest Editors' Introduction: Reproducible Research , 2009, Comput. Sci. Eng..

[9]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[10]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[11]  Cole Trapnell,et al.  Role of Rodent Secondary Motor Cortex in Value-based Action Selection Nih Public Access Author Manuscript , 2006 .

[12]  Leo Goodstadt,et al.  Ruffus: a lightweight Python library for computational pipelines , 2010, Bioinform..

[13]  Jonah Lehrer The Truth Wears Off , 2011 .

[14]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[15]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[16]  Ying Wang,et al.  RseqFlow: workflows for RNA-Seq data analysis , 2011, Bioinform..

[17]  R. Peng Reproducible Research in Computational Science , 2011, Science.

[18]  Bernard J. Pope,et al.  Bpipe: a tool for running and managing bioinformatics pipelines , 2012, Bioinform..

[19]  Cole Trapnell,et al.  TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions , 2013, Genome Biology.

[20]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[21]  Roman Valls Guimera,et al.  bcbio-nextgen: Automated, distributed next-gen sequencing pipeline , 2012 .

[22]  Sven Rahmann,et al.  Snakemake--a scalable bioinformatics workflow engine. , 2012, Bioinformatics.

[23]  James Taylor,et al.  Next-generation sequencing data interpretation: enhancing reproducibility and accessibility , 2012, Nature Reviews Genetics.

[24]  Anton Nekrutenko,et al.  Ten Simple Rules for Reproducible Computational Research , 2013, PLoS Comput. Biol..

[25]  Andrea Tanzer,et al.  A multi-split mapping algorithm for circular RNA, splicing, trans-splicing and fusion detection , 2014, Genome Biology.

[26]  Carole A. Goble,et al.  The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud , 2013, Nucleic Acids Res..

[27]  Nuno A. Fonseca,et al.  iRAP - an integrated RNA-seq Analysis Pipeline , 2014, bioRxiv.

[28]  Saurabh Baheti,et al.  MAP-RSeq: Mayo Analysis Pipeline for RNA sequencing , 2014, BMC Bioinformatics.

[29]  Peter J. Tonellato,et al.  COSMOS: Python library for massively parallel workflows , 2014, Bioinform..

[30]  S. Bustin,et al.  The reproducibility of biomedical research: Sleepers awake! , 2014, Biomolecular detection and quantification.

[31]  Andrew J. Oler,et al.  Unipro UGENE NGS pipelines and components for variant calling, RNA-seq and ChIP-seq data analyses , 2014, PeerJ.

[32]  Mathieu Blanchette,et al.  BigDataScript: a scripting language for data pipelines , 2014, Bioinform..

[33]  Paul Theodor Pyl,et al.  HTSeq—a Python framework to work with high-throughput sequencing data , 2014, bioRxiv.

[34]  M. Baker 1,500 scientists lift the lid on reproducibility , 2016, Nature.

[35]  S. Andrews,et al.  Cluster Flow: A user-friendly bioinformatics workflow tool. , 2016, F1000Research.

[36]  S. Andrews,et al.  Cluster Flow: A user-friendly bioinformatics workflow tool , 2016, F1000Research.

[37]  Gaurav Kaushik,et al.  Rabix: an open-source workflow executor supporting recomputability and interoperability of workflow descriptions , 2016, bioRxiv.

[38]  Paolo Di Tommaso,et al.  Nextflow enables reproducible computational workflows , 2017, Nature Biotechnology.

[39]  Marius van den Beek,et al.  The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update , 2018, Nucleic Acids Res..

[40]  Rolf Backofen,et al.  Practical computational reproducibility in the life sciences , 2017, bioRxiv.