CWL-Airflow: a lightweight pipeline manager supporting Common Workflow Language

Background Massive growth in the amount of research data and computational analysis has led to increased utilization of pipeline managers in biomedical computational research. However, each of more than 100 such managers uses its own way to describe pipelines, leading to difficulty porting workflows to different environments and therefore poor reproducibility of computational studies. For this reason, the Common Workflow Language (CWL) was recently introduced as a specification for platform-independent workflow description, and work began to transition existing pipelines and workflow managers to CWL. Findings Here, we present CWL-Airflow, an extension for the Apache Airflow pipeline manager supporting CWL. CWL-Airflow utilizes CWL v1.0 specification and can be used to run workflows on standalone MacOS/Linux servers, on clusters, or on a variety of cloud platforms. A sample CWL pipeline for processing of ChIP-Seq data is provided. Conclusions CWL-Airflow will provide users with the features of a fully-fledged pipeline manager and an ability to execute CWL workflows anywhere Airflow can run—from a laptop to cluster or cloud environment. Availability CWL-Airflow is available under Apache license v.2 and can be downloaded from https://barski-lab.github.io/cwl-airflow, http://doi.org/10.5281/zenodo.2669582, RRID: SCR_017196.

[1]  Benedict Paten,et al.  The Dockstore: enabling modular, community-focused sharing of Docker-based genomics tools and workflows , 2017, F1000Research.

[2]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[3]  Daniel J. Blankenberg,et al.  Galaxy: a platform for interactive large-scale genome analysis. , 2005, Genome research.

[4]  Jeremy Leipzig,et al.  A review of bioinformatic pipeline frameworks , 2016, Briefings Bioinform..

[5]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[6]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[7]  Ying Wang,et al.  Xenbase: a genomic, epigenomic and transcriptomic model organism database , 2017, Nucleic Acids Res..

[8]  Gaurav Kaushik,et al.  Graph Theory Approaches for Optimizing Biomedical Data Analysis Using Reproducible Workflows , 2016 .

[9]  R. Young,et al.  Super-Enhancers in the Control of Cell Identity and Disease , 2013, Cell.

[10]  Mary Goldman,et al.  Rapid and efficient analysis of 20,000 RNA-seq samples with Toil , 2016, bioRxiv.

[11]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[12]  Andrey V. Kartashov,et al.  BioWardrobe: an integrated platform for analysis of epigenomics and transcriptomics data , 2014, Genome Biology.

[13]  Artem Barski,et al.  Analysis of ChIP-Seq and RNA-Seq Data with BioWardrobe. , 2018, Methods in molecular biology.

[14]  Michael R. Crusoe,et al.  Common Workflow Language , 2015 .

[15]  Gaurav Kaushik,et al.  Rabix: an open-source workflow executor supporting recomputability and interoperability of workflow descriptions , 2016, bioRxiv.

[16]  Vanessa Sochat,et al.  Singularity: Scientific containers for mobility of compute , 2017, PloS one.

[17]  Marc D. Perry,et al.  ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia , 2012, Genome research.

[18]  Aaron R. Quinlan,et al.  BamTools: a C++ API and toolkit for analyzing and managing BAM files , 2011, Bioinform..