Building containerized workflows using the BioDepot-workflow-builder (Bwb)

We present BioDepot-workflow-Builder (BwB), a portable and open-source tool for creating bioinformatics workflows with a simple drag-and-drop graphical user interface. The individual components of the workflows are Docker containers which are available from public repositories or provided by the user. The use of software containers ensures that workflows will give identical results across different operating systems and hardware architectures. The use of Docker also allows for individual components to be deployed on the cloud. The modularity and ease of customization and installation of bioinformatics tools using BwB allows for researchers to efficiently test new workflows and compare competing algorithms. Since BwB itself is packaged in a Docker container, the setup is minimal. In particular, users only need to install Docker and have access to a web browser to begin creating and running workflows. As a proof-of-concept case study, we illustrated the feasibility of BwB by developing widgets for the RNA-seq differential expression analysis workflow employed by the NIH BD2K-LINCS Drug Toxicity Signature Generation Center at Mount Sinai. The app and all the containers are available on the BioDepot repository (https://hub.docker.eom/r/biodepot).

[1]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[2]  Paolo Di Tommaso,et al.  Nextflow enables reproducible computational workflows , 2017, Nature Biotechnology.

[3]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[4]  C. Elsik The pea aphid genome sequence brings theories of insect defense into question , 2010, Genome Biology.

[5]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[6]  Harald Barsnes,et al.  BioContainers: an open-source and community-driven framework for software standardization , 2017, Bioinform..

[7]  Lior Pachter,et al.  Differential analysis of RNA-seq incorporating quantification uncertainty , 2016, Nature Methods.

[8]  Kathleen M Jagodnik,et al.  Massive mining of publicly available RNA-seq data from human and mouse , 2017, Nature Communications.

[9]  L. Stein The case for cloud computing in genome informatics , 2010, Genome Biology.

[10]  R. Peng Reproducible Research in Computational Science , 2011, Science.

[11]  Richard Dobson,et al.  NGSeasy: a next generation sequencing pipeline in Docker containers , 2015 .

[12]  Alexander Sczyrba,et al.  Bioboxes: standardised containers for interchangeable bioinformatics software , 2015, GigaScience.

[13]  Jeffrey Chang,et al.  Biopython: Python tools for computational biology , 2000, SIGB.

[14]  J. Harrow,et al.  Systematic evaluation of spliced alignment programs for RNA-seq data , 2013, Nature Methods.

[15]  Ka Yee Yeung,et al.  Reproducible Bioconductor Workflows Using Browser-based Interactive Notebooks and Containers , 2017, bioRxiv.

[16]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[17]  Enis Afgan,et al.  Bio-Docklets: Virtualization Containers for Single-Step Execution of NGS Pipelines , 2017 .

[18]  Olivier Sallou,et al.  BioShaDock: a community driven bioinformatics shared Docker-based tools registry , 2015, F1000Research.

[19]  Jeremy Leipzig,et al.  A review of bioinformatic pipeline frameworks , 2016, Briefings Bioinform..

[20]  Daniel J. Gaffney,et al.  A survey of best practices for RNA-seq data analysis , 2016, Genome Biology.

[21]  Birgit Schmidt,et al.  Positioning and Power in Academic Publishing: Players, Agents and Agendas, 20th International Conference on Electronic Publishing, Göttingen, Germany, June 7-9, 2016 , 2016, ELPUB.

[22]  Brian A. Nosek,et al.  Promoting an open research culture , 2015, Science.

[23]  John Chilton,et al.  The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update , 2016, Nucleic Acids Res..

[24]  Carl Boettiger,et al.  An introduction to Docker for reproducible research , 2014, OPSR.

[25]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[26]  Matthew R. Pocock,et al.  The Bioperl toolkit: Perl modules for the life sciences. , 2002, Genome research.

[27]  Benedict Paten,et al.  The Dockstore: enabling modular, community-focused sharing of Docker-based genomics tools and workflows , 2017, F1000Research.

[28]  Lior Pachter,et al.  Near-optimal probabilistic RNA-seq quantification , 2016, Nature Biotechnology.

[29]  S. Oliver,et al.  Estimating the total number of phosphoproteins and phosphorylation sites in eukaryotic proteomes , 2017, GigaScience.

[30]  Pablo Prieto,et al.  The impact of Docker containers on the performance of genomic pipelines , 2015, PeerJ.

[31]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[32]  Mary Goldman,et al.  Toil enables reproducible, open source, big biomedical data analyses , 2017, Nature Biotechnology.

[33]  Ravi Iyengar,et al.  The Library of Integrated Network-Based Cellular Signatures NIH Program: System-Level Cataloging of Human Cells Response to Perturbations. , 2017, Cell systems.

[34]  Mary Goldman,et al.  Rapid and efficient analysis of 20,000 RNA-seq samples with Toil , 2016, bioRxiv.