CloudGT: A High Performance Genome Analysis Toolkit Leveraging Pipeline Optimization on Spark

The rapid development of NGS (Next Generation Sequencing) leads to the explosive increase of genome data. Massive data pose a great challenge to post-sequencing genomic data analysis process and put forward higher requirements for the performance and pipeline optimization of genome analysis tools. In this paper, we present CloudGT, a high performance genome analysis toolkit leveraging pipeline optimization based on Apache Spark. CloudGT accelerates genome analysis process by implementing parallelization of multi-core and multi-node using Spark framework. In order to improve the performance of genome analysis process, a pipeline optimization strategy based on Parquet storage is presented to reduce the storage space and I/O overhead throughout the process. Three main tools including CloudBWAMem, CloudDuplicateMark and CloudHaplotypeCaller to support read aligning, duplicate marking and variant calling are optimized and implemented which adapt to the pipeline optimization and further improve the overall performance. In our whole genome sequencing experiments, CloudGT is evaluated on a 10-node cluster and achieves high performance and scalability. The optimized pipeline totally reduces 80.95% storage space and 14.32% runtime than the unoptimized and is 26.6%-74.9% faster than state-of-the-art toolkits and approaches based on cluster with higher accuracy.

[1]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[2]  Elaine R. Mardis,et al.  A decade’s perspective on DNA sequencing technology , 2011, Nature.

[3]  Hamid Mushtaq,et al.  Cluster-based Apache Spark implementation of the GATK DNA analysis pipeline , 2015, 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[4]  Peter White,et al.  Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics , 2015, Genome Biology.

[5]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[6]  Reynold Xin,et al.  Apache Spark , 2016 .

[7]  Monya Baker,et al.  Next-generation sequencing: adjusting to data overload , 2010, Nature Methods.

[8]  H. Peter Hofstee,et al.  SparkGA: A Spark Framework for Cost Effective, Fast and Accurate DNA Analysis at Scale , 2017, BCB.

[9]  Jan Fostier,et al.  Halvade: scalable sequence analysis with MapReduce , 2015, Bioinform..

[10]  Shamim Reza,et al.  The Rise of Big Data and Cloud Computing , 2019 .

[11]  Xu Li,et al.  Accelerating large-scale genomic analysis with Spark , 2016, 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[12]  Mahidhar Tatineni,et al.  Group-based variant calling leveraging next-generation supercomputing for large-scale whole-genome sequencing studies , 2015, BMC Bioinformatics.

[13]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[14]  Mauricio O. Carneiro,et al.  From FastQ Data to High‐Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline , 2013, Current protocols in bioinformatics.

[15]  Vijay S. Kumar,et al.  A highly parallel next-generation DNA sequencing data analysis pipeline in Hadoop , 2015, 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[16]  Youliang Yan,et al.  HiGene: A high-performance platform for genomic data analysis , 2016, 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[17]  Walter L. Ruzzo,et al.  Compression of next-generation sequencing reads aided by highly efficient de novo assembly , 2012, Nucleic acids research.