BiSpark: a Spark-based highly scalable aligner for bisulfite sequencing data

BackgroundBisulfite sequencing is one of the major high-resolution DNA methylation measurement method. Due to the selective nucleotide conversion on unmethylated cytosines after treatment with sodium bisulfite, processing bisulfite-treated sequencing reads requires additional steps which need high computational demands. However, a dearth of efficient aligner that is designed for bisulfite-treated sequencing becomes a bottleneck of large-scale DNA methylome analyses.ResultsIn this study, we present a highly scalable, efficient, and load-balanced bisulfite aligner, BiSpark, which is designed for processing large volumes of bisulfite sequencing data. We implemented the BiSpark algorithm over the Apache Spark, a memory optimized distributed data processing framework, to achieve the maximum data parallel efficiency. The BiSpark algorithm is designed to support redistribution of imbalanced data to minimize delays on large-scale distributed environment.ConclusionsExperimental results on methylome datasets show that BiSpark significantly outperforms other state-of-the-art bisulfite sequencing aligners in terms of alignment speed and scalability with respect to dataset size and a number of computing nodes while providing highly consistent and comparable mapping results.AvailabilityThe implementation of BiSpark software package and source code is available at https://github.com/bhi-kimlab/BiSpark/.

[1]  Zachary D. Smith,et al.  Preparation of reduced representation bisulfite sequencing libraries for genome-scale DNA methylation profiling , 2011, Nature Protocols.

[2]  Christopher A. Miller,et al.  Pash 3.0: A versatile software package for read mapping and integrative analysis of genomic and epigenomic variation using massively parallel DNA sequencing , 2010, BMC Bioinformatics.

[3]  Michael Q. Zhang,et al.  Using quality scores and longer reads improves accuracy of Solexa read mapping , 2008, BMC Bioinformatics.

[4]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[5]  Michael Q. Zhang,et al.  BS-Seeker2: a versatile aligning pipeline for bisulfite sequencing data , 2013, BMC Genomics.

[6]  Weisong Shi,et al.  CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping , 2011, BMC Research Notes.

[7]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[8]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[9]  Euan J. Rodger,et al.  Comparison of alignment software for genome-wide bisulphite sequence data , 2012, Nucleic acids research.

[10]  Felix Krueger,et al.  Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications , 2011, Bioinform..

[11]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[12]  Wei Li,et al.  BSMAP: whole genome bisulfite sequence MAPping program , 2009, BMC Bioinformatics.

[13]  Geoffrey C. Fox,et al.  MapReduce in the Clouds for Science , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[14]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[15]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[16]  Ben Langmead,et al.  Genotyping in the Cloud with Crossbow , 2012, Current protocols in bioinformatics.

[17]  Pao-Yang Chen,et al.  BS Seeker: precise mapping for bisulfite sequencing , 2010, BMC Bioinformatics.

[18]  Stephan Beck,et al.  Methylome analysis using MeDIP-seq with low DNA concentrations , 2012, Nature Protocols.

[19]  Devon Patrick Ryan,et al.  Bison: bisulfite alignment on nodes of a cluster , 2014, BMC Bioinformatics.

[20]  Stefano Lonardi,et al.  BRAT-BW: efficient and accurate mapping of bisulfite-treated reads , 2012, Bioinform..

[21]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[22]  James G. Shanahan,et al.  Large Scale Distributed Data Science using Apache Spark , 2015, KDD.

[23]  Maged M. Michael,et al.  Scale-up x Scale-out: A Case Study using Nutch/Lucene , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[24]  A. Milosavljevic,et al.  Comparison and quantitative verification of mapping algorithms for whole-genome bisulfite sequencing , 2014, Nucleic acids research.

[25]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[26]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[27]  Jorge Amigo,et al.  SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data , 2016, PloS one.