Fast, accurate, and lightweight analysis of BS-treated reads with ERNE 2

BackgroundBisulfite treatment of DNA followed by sequencing (BS-seq) has become a standard technique in epigenetic studies, providing researchers with tools for generating single-base resolution maps of whole methylomes. Aligning bisulfite-treated reads, however, is a computationally difficult task: bisulfite treatment decreases the (lexical) complexity of low-methylated genomic regions, and C-to-T mismatches may reflect cytosine unmethylation rather than SNPs or sequencing errors. Further challenges arise both during and after the alignment phase: data structures used by the aligner should be fast and should fit into main memory, and the methylation-caller output should be somehow compressed, due to its significant size.MethodsAs far as data structures employed to align bisulfite-treated reads are concerned, solutions proposed in the literature can be roughly grouped into two main categories: those storing pointers at each text position (e.g. hash tables, suffix trees/arrays), and those using the information-theoretic minimum number of bits (e.g. FM indexes and compressed suffix arrays). The former are fast and memory consuming. The latter are much slower and light. In this paper, we try to close this gap proposing a data structure for aligning bisulfite-treated reads which is at the same time fast, light, and very accurate. We reach this objective by combining a recent theoretical result on succinct hashing with a bisulfite-aware hash function. Furthermore, the new versions of the tools implementing our ideas|the aligner ERNE-BS5 2 and the caller ERNE-METH 2|have been extended with increased downstream compatibility (EPP/Bismark cov output formats), output compression, and support for target enrichment protocols.ResultsExperimental results on public and simulated WGBS libraries show that our algorithmic solution is a competitive tradeoff between hash-based and BWT-based indexes, being as fast and accurate as the former, and as memory-efficient as the latter.ConclusionsThe new functionalities of our bisulfite aligner and caller make it a fast and memory efficient tool, useful to analyze big datasets with little computational resources, to easily process target enrichment data, and produce statistics such as protocol efficiency and coverage as a function of the distance from target regions.

[1]  Brent Pedersen,et al.  MethylCoder: software pipeline for bisulfite-treated sequences , 2011, Bioinform..

[2]  Alberto Policriti,et al.  rNA: a fast and accurate short reads numerical aligner , 2012, Bioinform..

[3]  Guy Joseph Jacobson,et al.  Succinct static data structures , 1988 .

[4]  Euan J. Rodger,et al.  Comparison of alignment software for genome-wide bisulphite sequence data , 2012, Nucleic acids research.

[5]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[6]  Thomas Lengauer,et al.  Comprehensive Analysis of DNA Methylation Data with RnBeads , 2014, Nature Methods.

[7]  T. Gingeras,et al.  Microarray-based DNA methylation profiling: technology and applications , 2022 .

[8]  Felix Krueger,et al.  Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications , 2011, Bioinform..

[9]  Alberto Policriti,et al.  Hashing and Indexing: Succinct DataStructures and Smoothed Analysis , 2014, ISAAC.

[10]  Stefano Lonardi,et al.  BRAT-BW: efficient and accurate mapping of bisulfite-treated reads , 2012, Bioinform..

[11]  Alberto Policriti,et al.  A randomized Numerical Aligner (rNA) , 2012, J. Comput. Syst. Sci..

[12]  Alberto Policriti,et al.  Fast randomized approximate string matching with succinct hash data structures , 2015, BMC Bioinformatics.

[13]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[14]  S. Nelson,et al.  Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning , 2008, Nature.

[15]  Wei Li,et al.  BSMAP: whole genome bisulfite sequence MAPping program , 2009, BMC Bioinformatics.

[16]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[17]  Dirk Schübeler,et al.  Methylated DNA immunoprecipitation (MeDIP). , 2009, Methods in molecular biology.

[18]  M. Morgante,et al.  An Extensive Evaluation of Read Trimming Effects on Illumina NGS Data Analysis , 2013, PloS one.

[19]  Chan-Su Shin,et al.  Algorithms and Computation : 25th International Symposium, ISAAC 2014, Jeonju, Korea, December 15-17, 2014, Proceedings , 2014 .

[20]  Liqing Zhang,et al.  Objective and Comprehensive Evaluation of Bisulfite Short Read Mapping Tools , 2014, Adv. Bioinformatics.

[21]  Michael Q. Zhang,et al.  Updates to the RMAP short-read mapping software , 2009, Bioinform..

[22]  Alberto Policriti,et al.  ERNE-BS5: aligning BS-treated sequences by multiple hits on a 5-letters alphabet , 2012, BCB '12.

[23]  Siu-Ming Yiu,et al.  SOAP2: an improved ultrafast tool for short read alignment , 2009, Bioinform..

[24]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[25]  J. Tost DNA methylation : methods and protocols , 2009 .

[26]  R. Lister,et al.  Highly Integrated Single-Base Resolution Maps of the Epigenome in Arabidopsis , 2008, Cell.

[27]  Michael Q. Zhang,et al.  BS-Seeker2: a versatile aligning pipeline for bisulfite sequencing data , 2013, BMC Genomics.