A highly parallel next-generation DNA sequencing data analysis pipeline in Hadoop

The era of precision medicine is best exemplified by the growing reliance on next-generation sequencing (NGS) technologies to provide improved disease diagnosis and targeted therapeutic selection. Well-established NGS data analysis software tools, in their unmodified form, can take days to identify and interpret single nucleotide and structural variations in DNA for a single patient. To improve sample analysis throughput, we developed a highly parallel end-to-end next-generation DNA sequencing data analysis pipeline in Hadoop. In our pipeline, each step is parallelized not only across samples but also within each individual sample, achieving a 30× speedup over a single server workflow execution. Furthermore, we extensively evaluate the viability of having our Hadoop-based pipeline as part of a larger commercial genomic services offering-we demonstrate how our pipeline scales sub-linearly both with the number of samples being analyzed and with the depth of coverage of those samples. In particular, on our commodity cluster, 10× as many samples resulted in only a 2.24× increase in the execution time, and a 4× increase in coverage depth resulted in only a 2.53× growth in execution time. We anticipate that such improvements will allow large cohort populations to be analyzed in parallel, and can fundamentally change the way DNA sequencing analyses are used by both researchers and clinicians.

[1]  Ian Foster,et al.  Gnare: Automated System For High-Throughput Genome Analysis With Grid Computational Backend , 2005, Journal of Clinical Monitoring and Computing.

[2]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[3]  Wen Tang,et al.  Investigating Memory Optimization of Hash-index for Next Generation Sequencing on Multi-core Architecture , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[4]  Wu-chun Feng,et al.  Accelerating Data-Intensive Genome Analysis in the Cloud , 2013 .

[5]  Eric S. Lander,et al.  The genomic complexity of primary human prostate cancer , 2010, Nature.

[6]  Srinivas Aluru,et al.  A Review of Hardware Acceleration for Computational Genomics , 2014, IEEE Design & Test.

[7]  Michael Stonebraker,et al.  GenBase: a complex analytics genomics benchmark , 2014, SIGMOD Conference.

[8]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[9]  Kamesh Madduri,et al.  Engineering a high-performance SNP detection pipeline , 2015 .

[10]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[11]  M. Schatz,et al.  Big Data: Astronomical or Genomical? , 2015, PLoS biology.

[12]  Christopher A. Miller,et al.  VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. , 2012, Genome research.

[13]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[14]  Shawn E. Yost,et al.  Mutascope: sensitive detection of somatic mutations from deep amplicon sequencing , 2013, Bioinform..

[15]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[16]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[17]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[18]  M. C. Schatz,et al.  The DNA data deluge , 2013, IEEE Spectrum.

[19]  Ronald C. Taylor An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics , 2010, BMC Bioinformatics.

[20]  Marc Via i García An integrated map of genetic variation from 1,092 human genomes , 2012 .

[21]  Jan Fostier,et al.  Halvade: scalable sequence analysis with MapReduce , 2015, Bioinform..

[22]  Lan Mei,et al.  Shimmer: detection of genetic alterations in tumors using next-generation sequence data , 2013, Bioinform..

[23]  Jeremy Kepner,et al.  Genetic sequence matching using D4M big data approaches , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).

[24]  Abhishek Roy,et al.  Building Highly-Optimized, Low-Latency Pipelines for Genomic Data Analysis , 2015, CIDR.