Accelerating Comparative Genomics Workflows in a Distributed Environment with Optimized Data Partitioning

The advent of new sequencing technology has generated massive amounts of biological data at unprecedented rates. High-throughput bioinformatics tools are required to keep pace with this. Here, we implement a workflow-based model for parallelizing the data intensive task of genome alignment and variant calling with BWA and GATK's Haplotype Caller. We explore different approaches of partitioning data and how each affect the run time. We observe granularity-based partitioning for BWA and alignment-based partitioning for Halo type Caller to be the optimal choices for the pipeline. We identify the various challenges encountered while developing such an application and provide an insight into addressing them. We report significant performance improvements, from 12 days to 4 hours, while running the BWA-GATK pipeline using 100 nodes for analyzing high-coverage oak tree data.

[1]  J. Reis-Filho Next-generation sequencing , 2009, Breast Cancer Research.

[2]  Ken Chen,et al.  VarScan: variant detection in massively parallel sequencing of individual and pooled samples , 2009, Bioinform..

[3]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[4]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[5]  Heng Li,et al.  A survey of sequence alignment algorithms for next-generation sequencing , 2010, Briefings Bioinform..

[6]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[7]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[8]  Ruiqiang Li,et al.  SOAP: short oligonucleotide alignment program , 2008, Bioinform..

[9]  Amy E. Hawkins,et al.  DNA sequencing of a cytogenetically normal acute myeloid leukemia genome , 2008, Nature.

[10]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[11]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[12]  Joshua S. Paul,et al.  Genotype and SNP calling from next-generation sequencing data , 2011, Nature Reviews Genetics.

[13]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[14]  Timothy B. Stockwell,et al.  Evaluation of next generation sequencing platforms for population targeted sequencing studies , 2009, Genome Biology.

[15]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[16]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[17]  Wu-chun Feng,et al.  The design, implementation, and evaluation of mpiBLAST , 2003 .

[18]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[19]  Douglas Thain,et al.  Harnessing parallelism in multicore clusters with the All-Pairs, Wavefront, and Makeflow abstractions , 2010, Cluster Computing.

[20]  Li Yi,et al.  Harnessing parallelism in multicore clusters with the all-pairs and wavefront abstractions , 2009, HPDC '09.

[21]  Peter M. Rice,et al.  The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants , 2009, Nucleic acids research.

[22]  Douglas Thain,et al.  All-pairs: An abstraction for data-intensive cloud computing , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[23]  Siu-Ming Yiu,et al.  SOAP2: an improved ultrafast tool for short read alignment , 2009, Bioinform..

[24]  M. Schatz,et al.  Searching for SNPs with cloud computing , 2009, Genome Biology.

[25]  Alekh Jindal,et al.  Hadoop++ , 2010 .

[26]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[27]  Douglas Thain,et al.  Making work queue cluster-friendly for data intensive scientific applications , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[29]  A. Amores,et al.  Rapid and cost-effective polymorphism identification and genotyping using restriction site associated DNA (RAD) markers. , 2007, Genome research.

[30]  Huanming Yang,et al.  SNP detection for massively parallel whole-genome resequencing. , 2009, Genome research.

[31]  Douglas Thain,et al.  Adapting bioinformatics applications for heterogeneous systems: a case study , 2011, ECMLS '11.

[32]  M. Morris,et al.  The Design , 1998 .

[33]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[34]  Simon Handley,et al.  On the use of a directed acyclic graph to represent a population of computer programs , 1994, Proceedings of the First IEEE Conference on Evolutionary Computation. IEEE World Congress on Computational Intelligence.

[35]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[36]  Gianluigi Zanetti,et al.  SEAL: a distributed short read mapping and duplicate removal tool , 2011, Bioinform..

[37]  Douglas Thain,et al.  Work Queue + Python: A Framework For Scalable Scientific Ensemble Applications , 2011 .

[38]  Wolfgang Gentzsch,et al.  Sun Grid Engine: towards creating a compute power grid , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[39]  Kuo-Bin Li,et al.  ClustalW-MPI: ClustalW analysis using distributed and parallel computing , 2003, Bioinform..