uap: reproducible and robust HTS data analysis

Background A lack of reproducibility has been repeatedly criticized in computational research. High throughput sequencing (HTS) data analysis is a complex multi-step process. For most of the steps a range of bioinformatic tools is available and for most tools manifold parameters need to be set. Due to this complexity, HTS data analysis is particularly prone to reproducibility and consistency issues. We have defined four criteria that in our opinion ensure a minimal degree of reproducible research for HTS data analysis. A series of workflow management systems is available for assisting complex multi-step data analyses. However, to the best of our knowledge, none of the currently available work flow management systems satisfies all four criteria for reproducible HTS analysis. Results Here we present uap, a workflow management system dedicated to robust, consistent, and reproducible HTS data analysis. uap is optimized for the application to omics data, but can be easily extended to other complex analyses. It is available under the GNU GPL v3 license at https://github.com/yigbt/uap. Conclusions uap is a freely available tool that enables researchers to easily adhere to reproducible research principles for HTS data analyses.

[1]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[2]  Dustin E. Schones,et al.  High-Resolution Profiling of Histone Methylations in the Human Genome , 2007, Cell.

[3]  Leo Goodstadt,et al.  Ruffus: a lightweight Python library for computational pipelines , 2010, Bioinform..

[4]  Andy B. Yoo,et al.  Approved for Public Release; Further Dissemination Unlimited X-ray Pulse Compression Using Strained Crystals X-ray Pulse Compression Using Strained Crystals , 2002 .

[5]  R. Peng Reproducible Research in Computational Science , 2011, Science.

[6]  Carole A. Goble,et al.  The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud , 2013, Nucleic Acids Res..

[7]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[8]  S. Bustin,et al.  The reproducibility of biomedical research: Sleepers awake! , 2014, Biomolecular detection and quantification.

[9]  Rolf Backofen,et al.  Practical computational reproducibility in the life sciences , 2017, bioRxiv.

[10]  Andrea Tanzer,et al.  A multi-split mapping algorithm for circular RNA, splicing, trans-splicing and fusion detection , 2014, Genome Biology.

[11]  Peter J. Tonellato,et al.  COSMOS: Python library for massively parallel workflows , 2014, Bioinform..

[12]  Ying Wang,et al.  RseqFlow: workflows for RNA-Seq data analysis , 2011, Bioinform..

[13]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[14]  Paolo Di Tommaso,et al.  Nextflow enables reproducible computational workflows , 2017, Nature Biotechnology.

[15]  M. Baker 1,500 scientists lift the lid on reproducibility , 2016, Nature.

[16]  Anton Nekrutenko,et al.  Ten Simple Rules for Reproducible Computational Research , 2013, PLoS Comput. Biol..

[17]  Gaurav Kaushik,et al.  Rabix: an open-source workflow executor supporting recomputability and interoperability of workflow descriptions , 2016, bioRxiv.

[18]  Brian E. Granger,et al.  IPython: A System for Interactive Scientific Computing , 2007, Computing in Science & Engineering.

[19]  Cole Trapnell,et al.  Role of Rodent Secondary Motor Cortex in Value-based Action Selection Nih Public Access Author Manuscript , 2006 .

[20]  Jonah Lehrer The Truth Wears Off , 2011 .

[21]  Bernard J. Pope,et al.  Bpipe: a tool for running and managing bioinformatics pipelines , 2012, Bioinform..

[22]  Sven Rahmann,et al.  Genome analysis , 2022 .

[23]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[24]  Peter F. Stadler,et al.  Fast Mapping of Short Sequences with Mismatches, Insertions and Deletions Using Index Structures , 2009, PLoS Comput. Biol..

[25]  Nuno A. Fonseca,et al.  iRAP - an integrated RNA-seq Analysis Pipeline , 2014, bioRxiv.

[26]  Paul Theodor Pyl,et al.  HTSeq—a Python framework to work with high-throughput sequencing data , 2014, bioRxiv.

[27]  Cole Trapnell,et al.  TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions , 2013, Genome Biology.

[28]  Yihui Xie,et al.  knitr: A Comprehensive Tool for Reproducible Research in R , 2018, Implementing Reproducible Research.

[29]  Sergey Fomel,et al.  Guest Editors' Introduction: Reproducible Research , 2009, Comput. Sci. Eng..

[30]  Thorsten Meinl,et al.  KNIME: The Konstanz Information Miner , 2007, GfKl.

[31]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[32]  S. Andrews,et al.  Cluster Flow: A user-friendly bioinformatics workflow tool , 2016, F1000Research.

[33]  James Taylor,et al.  Next-generation sequencing data interpretation: enhancing reproducibility and accessibility , 2012, Nature Reviews Genetics.

[34]  Roman Valls Guimera,et al.  bcbio-nextgen: Automated, distributed next-gen sequencing pipeline , 2012 .

[35]  Friedrich Leisch,et al.  Sweave: Dynamic Generation of Statistical Reports Using Literate Data Analysis , 2002, COMPSTAT.

[36]  Saurabh Baheti,et al.  MAP-RSeq: Mayo Analysis Pipeline for RNA sequencing , 2014, BMC Bioinformatics.

[37]  Mathieu Blanchette,et al.  BigDataScript: a scripting language for data pipelines , 2014, Bioinform..

[38]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[39]  Andrew J. Oler,et al.  Unipro UGENE NGS pipelines and components for variant calling, RNA-seq and ChIP-seq data analyses , 2014, PeerJ.