论文信息 - Cluster-based Apache Spark implementation of the GATK DNA analysis pipeline

Cluster-based Apache Spark implementation of the GATK DNA analysis pipeline

Fast progress in next generation sequencing has dramatically increased the throughout of DNA sequencing, resulting in the availability of large DNA data sets ready for analysis. However, post-sequencing DNA analysis has become the bottleneck in using these data sets, as it requires powerful and scalable tools to perform the needed analysis. A typical analysis pipeline consists of a number of steps, not all of which can readily scale on a distributed computing infrastructure. Recently, tools like Halvade, a Hadoop MapReduce solution, and Churchill, an HPC cluster-based solution, addressed this issue of scalability in the GATK DNA analysis pipeline. In this paper, we present a framework that implements an in-memory distributed version of the GATK pipeline using Apache Spark. Our framework reduced execution time by keeping data active in the memory between the map and reduce steps. In addition, it has a dynamic load balancing algorithm that better utilizes system performance using runtime statistics of the active workload. Experiments on a 4 node cluster with 64 virtual cores show that this approach is 63% faster than a Hadoop MapReduce based solution.

Hamid Mushtaq | Zaid Al-Ars | Z. Al-Ars | Hamid Mushtaq

[1] Peter White,et al. Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics , 2015, Genome Biology.

[2] Walter L. Ruzzo,et al. Compression of next-generation sequencing reads aided by highly efficient de novo assembly , 2012, Nucleic acids research.

[3] Michael R. Speicher,et al. A survey of tools for variant analysis of next-generation genome sequencing data , 2013, Briefings Bioinform..

[4] Mauricio O. Carneiro,et al. From FastQ Data to High‐Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline , 2013, Current protocols in bioinformatics.

[5] 김동규,et al. [서평]「Algorithms on Strings, Trees, and Sequences」 , 2000 .

[6] Heng Li. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[7] Jan Fostier,et al. Halvade: scalable sequence analysis with MapReduce , 2015, Bioinform..