PipeMEM: A Framework to Speed Up BWA-MEM in Spark with Low Overhead

(1) Background: DNA sequence alignment process is an essential step in genome analysis. BWA-MEM has been a prevalent single-node tool in genome alignment because of its high speed and accuracy. The exponentially generated genome data requiring a multi-node solution to handle large volumes of data currently remains a challenge. Spark is a ubiquitous big data platform that has been exploited to assist genome alignment in handling this challenge. Nonetheless, existing works that utilize Spark to optimize BWA-MEM suffer from higher overhead. (2) Methods: In this paper, we presented PipeMEM, a framework to accelerate BWA-MEM with lower overhead with the help of the pipe operation in Spark. We additionally proposed to use a pipeline structure and in-memory-computation to accelerate PipeMEM. (3) Results: Our experiments showed that, on paired-end alignment tasks, our framework had low overhead. In a multi-node environment, our framework, on average, was 2.27× faster compared with BWASpark (an alignment tool in Genome Analysis Toolkit (GATK)), and 2.33× faster compared with SparkBWA. (4) Conclusions: PipeMEM could accelerate BWA-MEM in the Spark environment with high performance and low overhead.

[1]  Raja Mazumder,et al.  High-Performance Integrated Virtual Environment (HIVE) Tools and Applications for Big Data Analysis , 2014, Genes.

[2]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[3]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[4]  Jorge González-Domínguez,et al.  parSRA: A framework for the parallel execution of short read aligners on compute clusters , 2017, J. Comput. Sci..

[5]  Han Fang,et al.  GenomeScope: Fast reference-free genome profiling from short reads , 2016, bioRxiv.

[6]  Leonid Oliker,et al.  merAligner: A Fully Parallel Sequence Aligner , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[7]  Gianluigi Zanetti,et al.  Biodoop: Bioinformatics on Hadoop , 2009, 2009 International Conference on Parallel Processing Workshops.

[8]  Yongchao Liu,et al.  CUSHAW: a CUDA compatible short read aligner to large genomes based on the Burrows-Wheeler transform , 2012, Bioinform..

[9]  Stéphane Le Crom,et al.  Eoulsan: a cloud computing-based framework facilitating high throughput sequencing analyses , 2012, Bioinform..

[10]  Kai Wang,et al.  BioPig: a Hadoop-based analytic toolkit for large-scale sequence data , 2013, Bioinform..

[11]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[12]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[13]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[14]  Jorge Amigo,et al.  SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data , 2016, PloS one.

[15]  P. Beckert,et al.  PhyResSE: a Web Tool Delineating Mycobacterium tuberculosis Antibiotic Resistance and Lineage from Whole-Genome Sequencing Data , 2015, Journal of Clinical Microbiology.

[16]  Knut Reinert,et al.  RazerS 3: Faster, fully sensitive read mapping , 2012, Bioinform..

[17]  Arthur W. Toga,et al.  Next Generation Sequence Analysis and Computational Genomics Using Graphical Pipeline Workflows , 2012, Genes.

[18]  Ryan M. Layer,et al.  SpeedSeq: Ultra-fast personal genome analysis and interpretation , 2014, Nature Methods.

[19]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[20]  Tomás F. Pena,et al.  BigBWA: approaching the Burrows-Wheeler aligner to Big Data technologies , 2015, Bioinform..

[21]  Ke Qiu,et al.  Speeding Up Large-Scale Next Generation Sequencing Data Analysis with pBWA , 2017 .

[22]  Gabor T. Marth,et al.  SSW Library: An SIMD Smith-Waterman C/C++ Library for Use in Genomic Applications , 2012, PloS one.

[23]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[24]  Marek S. Wiewiórka,et al.  SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision , 2014, Bioinform..

[25]  De Giusti,et al.  Structured Parallel Programming: patterns for efficient computation , 2015 .

[26]  Michael Isard,et al.  Scalability! But at what COST? , 2015, HotOS.

[27]  Gianluigi Zanetti,et al.  SEAL: a distributed short read mapping and duplicate removal tool , 2011, Bioinform..

[28]  Kai Xu,et al.  S-Aligner: Ultrascalable Read Mapping on Sunway Taihu Light , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[29]  Wu-chun Feng,et al.  The design, implementation, and evaluation of mpiBLAST , 2003 .