Scalable Workflows and Reproducible Data Analysis for Genomics

Biological, clinical, and pharmacological research now often involves analyses of genomes, transcriptomes, proteomes, and interactomes, within and between individuals and across species. Due to large volumes, the analysis and integration of data generated by such high-throughput technologies have become computationally intensive, and analysis can no longer happen on a typical desktop computer.In this chapter we show how to describe and execute the same analysis using a number of workflow systems and how these follow different approaches to tackle execution and reproducibility issues. We show how any researcher can create a reusable and reproducible bioinformatics pipeline that can be deployed and run anywhere. We show how to create a scalable, reusable, and shareable workflow using four different workflow engines: the Common Workflow Language (CWL), Guix Workflow Language (GWL), Snakemake, and Nextflow. Each of which can be run in parallel.We show how to bundle a number of tools used in evolutionary biology by using Debian, GNU Guix, and Bioconda software distributions, along with the use of container systems, such as Docker, GNU Guix, and Singularity. Together these distributions represent the overall majority of software packages relevant for biology, including PAML, Muscle, MAFFT, MrBayes, and BLAST. By bundling software in lightweight containers, they can be deployed on a desktop, in the cloud, and, increasingly, on compute clusters.By bundling software through these public software distributions, and by creating reproducible and shareable pipelines using these workflow engines, not only do bioinformaticians have to spend less time reinventing the wheel but also do we get closer to the ideal of making science reproducible. The examples in this chapter allow a quick comparison of different solutions.

[1]  Jeremy Leipzig,et al.  A review of bioinformatic pipeline frameworks , 2016, Briefings Bioinform..

[2]  Alan Edelman,et al.  Julia: A Fast Dynamic Language for Technical Computing , 2012, ArXiv.

[3]  D. Higgins,et al.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega , 2011, Molecular systems biology.

[4]  Richard L. Graham,et al.  Open MPI: A Flexible High Performance MPI , 2005, PPAM.

[5]  Olivier Sallou,et al.  Community-driven development for computational biology at Sprints, Hackathons and Codefests , 2014, BMC Bioinformatics.

[6]  Sean R. Eddy,et al.  A Probabilistic Model of Local Sequence Alignment That Simplifies Statistical Significance Estimation , 2008, PLoS Comput. Biol..

[7]  Yuri Pirola,et al.  Bioconda: sustainable and comprehensive software distribution for the life sciences , 2017, Nature Methods.

[8]  Pablo Prieto,et al.  The impact of Docker containers on the performance of genomic pipelines , 2015, PeerJ.

[9]  Peter H. Welch,et al.  CSP Networking for Java (JCSP.net) , 2002, International Conference on Computational Science.

[10]  Chris Okasaki,et al.  Purely functional data structures , 1998 .

[11]  Vanessa Sochat,et al.  Singularity: Scientific containers for mobility of compute , 2017, PloS one.

[12]  M. Snir,et al.  Big data, but are we ready? , 2011, Nature Reviews Genetics.

[13]  C. E. Pearson,et al.  Table S2: Trans-factors and trinucleotide repeat instability Trans-factor , 2010 .

[14]  Brian W. Kernighan,et al.  The Go Programming Language , 2015 .

[15]  Ludovic Courtès Functional Package Management with Guix , 2013, ELS.

[16]  Michael R. Crusoe,et al.  Common Workflow Language , 2015 .

[17]  Alexandros Stamatakis,et al.  Exploiting Fine-Grained Parallelism in the Phylogenetic Likelihood Function with MPI, Pthreads, and OpenMP: A Performance Study , 2008, PRIB.

[18]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[19]  Sven Rahmann,et al.  Snakemake--a scalable bioinformatics workflow engine. , 2012, Bioinformatics.

[20]  Vanessa V. Sochat Singularity Registry: Open Source Registry for Singularity Images , 2017, J. Open Source Softw..

[21]  Jonathan D. G. Jones,et al.  Genome sequence and analysis of the Irish potato famine pathogen Phytophthora infestans , 2009, Nature.

[22]  Gaston H. Gonnet,et al.  Estimates of Positive Darwinian Selection Are Inflated by Errors in Sequencing, Annotation, and Alignment , 2009, Genome biology and evolution.

[23]  Martin Odersky,et al.  Programming in Scala , 2008 .

[24]  Bernard Sufrin,et al.  Communicating Scala Objects , 2008, CPA.

[25]  Cory Doctorow,et al.  Big data: Welcome to the petacentre , 2008, Nature.

[26]  Sergei L. Kosakovsky Pond,et al.  HyPhy: hypothesis testing using phylogenies , 2005, Bioinform..

[27]  Paolo Di Tommaso,et al.  Nextflow enables reproducible computational workflows , 2017, Nature Biotechnology.

[28]  C. A. R. Hoare,et al.  Communicating sequential processes , 1978, CACM.

[29]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[30]  Peer Bork,et al.  PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments , 2006, Nucleic Acids Res..

[31]  Mary Goldman,et al.  Toil enables reproducible, open source, big biomedical data analyses , 2017, Nature Biotechnology.

[32]  G. Nolan,et al.  Computational solutions to large-scale data management and analysis , 2010, Nature Reviews Genetics.

[33]  Ziheng Yang,et al.  PAML: a program package for phylogenetic analysis by maximum likelihood , 1997, Comput. Appl. Biosci..

[34]  John P. Huelsenbeck,et al.  MrBayes 3: Bayesian phylogenetic inference under mixed models , 2003, Bioinform..

[35]  Carole A. Goble,et al.  Community-driven computational biology with Debian Linux , 2010, BMC Bioinformatics.

[36]  Richard M. Stallman,et al.  GNU make : a program for directing recompilation , 1996 .

[37]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[38]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.