SNPs detection by eBWT positional clustering

BackgroundSequencing technologies keep on turning cheaper and faster, thus putting a growing pressure for data structures designed to efficiently store raw data, and possibly perform analysis therein. In this view, there is a growing interest in alignment-free and reference-free variants calling methods that only make use of (suitably indexed) raw reads data.ResultsWe develop the positional clustering theory that (i) describes how the extended Burrows–Wheeler Transform (eBWT) of a collection of reads tends to cluster together bases that cover the same genome position (ii) predicts the size of such clusters, and (iii) exhibits an elegant and precise LCP array based procedure to locate such clusters in the eBWT. Based on this theory, we designed and implemented an alignment-free and reference-free SNPs calling method, and we devised a consequent SNPs calling pipeline. Experiments on both synthetic and real data show that SNPs can be detected with a simple scan of the eBWT and LCP arrays as, in accordance with our theoretical framework, they are within clusters in the eBWT of the reads. Finally, our tool intrinsically performs a reference-free evaluation of its accuracy by returning the coverage of each SNP.ConclusionsBased on the results of the experiments on synthetic and real data, we conclude that the positional clustering framework can be effectively used for the problem of identifying SNPs, and it appears to be a promising approach for calling other type of variants directly on raw sequencing data.AvailabilityThe software ebwt2snp is freely available for academic use at: https://github.com/nicolaprezza/ebwt2snp.

[1]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[2]  Zamin Iqbal,et al.  Identifying and Classifying Trait Linked Polymorphisms in Non-Reference Species by Walking Coloured de Bruijn Graphs , 2013, PloS one.

[3]  Jan Schröder,et al.  Genome analysis SHREC : a short-read error correction method , 2009 .

[4]  Giovanni Manzini,et al.  Lightweight BWT and LCP Merging via the Gap Algorithm , 2017, SPIRE.

[5]  Cristina Dutra de Aguiar Ciferri,et al.  Generalized enhanced suffix array construction in external memory , 2017, Algorithms for Molecular Biology.

[6]  Giovanna Rosone,et al.  Lightweight algorithms for constructing and inverting the BWT of string collections , 2013, Theor. Comput. Sci..

[7]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[8]  Barry G. Hall,et al.  When Whole-Genome Alignments Just Won't Work: kSNP v2 Software for Alignment-Free SNP Discovery and Phylogenetics of Hundreds of Microbial Genomes , 2013, PloS one.

[9]  Richard M Leggett,et al.  Reference-free SNP detection: dealing with the data deluge , 2014, BMC Genomics.

[10]  Rayan Chikhi,et al.  Reference-free detection of isolated SNPs , 2014, Nucleic acids research.

[11]  Siu-Ming Yiu,et al.  SOAP2: an improved ultrafast tool for short read alignment , 2009, Bioinform..

[12]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[13]  Nuno A. Fonseca,et al.  Assemblathon 1: a competitive assessment of de novo short read assembly methods. , 2011, Genome research.

[14]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[15]  Giovanna Rosone,et al.  Detecting Mutations by eBWT , 2018, WABI.

[16]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[17]  A global reference for human genetic variation , 2015, Nature.

[18]  Antonio Restivo,et al.  An extension of the Burrows-Wheeler Transform , 2007, Theor. Comput. Sci..

[19]  Tomasz Marek Kowalski,et al.  Indexing Arbitrary-Length k-Mers in Sequencing Reads , 2015, PloS one.

[20]  Asako Koike,et al.  Analysis of genomic rearrangements by using the Burrows-Wheeler transform of short-read data , 2015, BMC Bioinformatics.

[21]  Umer Zeeshan Ijaz,et al.  Illumina error profiles: resolving fine-scale variation in metagenomic sequencing data , 2016, BMC Bioinformatics.

[22]  Giovanna Rosone,et al.  Lightweight LCP construction for very large collections of strings , 2016, J. Discrete Algorithms.

[23]  Thierry Lecroq,et al.  Querying large read collections in main memory: a versatile data structure , 2011, BMC Bioinformatics.

[24]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[25]  Guilherme P. Telles,et al.  Inducing enhanced suffix arrays for string collections , 2017, Theor. Comput. Sci..

[26]  Niko Välimäki,et al.  Scalable and Versatile k-mer Indexing for High-Throughput Sequencing Data , 2013, ISBRA.

[27]  Pierre Peterlongo,et al.  Mapping-Free and Assembly-Free Discovery of Inversion Breakpoints from Raw NGS Reads , 2014, AlCoB.

[28]  Fei Shi,et al.  Suffix Arrays for Multiple Strings: A Method for On-Line Multiple String Searches , 1996, ASIAN.

[29]  Marie-France Sagot,et al.  Identifying SNPs without a Reference Genome by Comparing Raw Reads , 2010, SPIRE.

[30]  Marie-France Sagot,et al.  Theme: Computational Biology and Bioinformatics Computational Sciences for Biology, Medicine and the Environment , 2012 .

[31]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[32]  Jens Stoye,et al.  metaBEETL: high-throughput analysis of heterogeneous microbial populations from shotgun DNA sequences , 2013, BMC Bioinformatics.

[33]  Pierre Peterlongo,et al.  DiscoSnp++: de novo detection of small variants from raw unassembled read set(s) , 2017, bioRxiv.

[34]  Giovanna Rosone,et al.  Comparing DNA Sequence Collections by Direct Comparison of Compressed Text Indexes , 2012, WABI.

[35]  Roberto Grossi,et al.  Efficient Bubble Enumeration in Directed Graphs , 2012, SPIRE.

[36]  Sarah Legge,et al.  Conservation of the Patchily Distributed and Declining Purple-Crowned Fairy-Wren (Malurus coronatus coronatus) across a Vast Landscape: The Need for a Collaborative Landscape-Scale Approach , 2013, PloS one.

[37]  Asako Koike,et al.  Ultrafast SNP analysis using the Burrows-Wheeler transform of short-read data , 2015, Bioinform..