CGAT-core: a python framework for building scalable, reproducible computational biology workflows

In the genomics era computational biologists regularly need to process, analyse and integrate large and complex biomedical datasets. Analysis inevitably involves multiple dependent steps, resulting in complex pipelines or workflows, often with several branches. Large data volumes mean that processing needs to be quick and efficient and scientific rigour requires that analysis be consistent and fully reproducible. We have developed CGAT-core, a python package for the rapid construction of complex computational workflows. CGAT-core seamlessly handles parallelisation across high performance computing clusters, integration of Conda environments, full parameterisation, database integration and logging. To illustrate our workflow framework, we present a pipeline for the analysis of RNAseq data using pseudo-alignment.

[1]  Sébastien Lemieux,et al.  Harnessing virtual machines to simplify next-generation DNA sequencing analysis , 2013, Bioinform..

[2]  Andrew J. Oler,et al.  Unipro UGENE NGS pipelines and components for variant calling, RNA-seq and ChIP-seq data analyses , 2014, PeerJ.

[3]  Mary Goldman,et al.  Toil enables reproducible, open source, big biomedical data analyses , 2017, Nature Biotechnology.

[4]  Paolo Di Tommaso,et al.  Nextflow enables reproducible computational workflows , 2017, Nature Biotechnology.

[5]  Andrew I. Su,et al.  Omics Pipe: a community-based framework for reproducible multi-omics data analysis , 2015, Bioinform..

[6]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[7]  Carole A. Goble,et al.  The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud , 2013, Nucleic Acids Res..

[8]  Olga Golosova,et al.  Unipro UGENE: a unified bioinformatics toolkit , 2012, Bioinform..

[9]  John Chilton,et al.  The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update , 2016, Nucleic Acids Res..

[10]  Lior Pachter,et al.  Near-optimal probabilistic RNA-seq quantification , 2016, Nature Biotechnology.

[11]  Jeremy Leipzig,et al.  A review of bioinformatic pipeline frameworks , 2016, Briefings Bioinform..

[12]  Peter J. Tonellato,et al.  COSMOS: Python library for massively parallel workflows , 2014, Bioinform..

[13]  Sven Rahmann,et al.  Snakemake--a scalable bioinformatics workflow engine. , 2012, Bioinformatics.

[14]  Leo Goodstadt,et al.  Ruffus: a lightweight Python library for computational pipelines , 2010, Bioinform..