Variable-order reference-free variant discovery with the Burrows-Wheeler Transform

Background In [Prezza et al., AMB 2019], a new reference-free and alignment-free framework for the detection of SNPs was suggested and tested. The framework, based on the Burrows-Wheeler Transform (BWT), significantly improves sensitivity and precision of previous de Bruijn graphs based tools by overcoming several of their limitations, namely: (i) the need to establish a fixed value, usually small, for the order k, (ii) the loss of important information such as k-mer coverage and adjacency of k-mers within the same read, and (iii) bad performance in repeated regions longer than k bases. The preliminary tool, however, was able to identify only SNPs and it was too slow and memory consuming due to the use of additional heavy data structures (namely, the Suffix and LCP arrays), besides the BWT. Results In this paper, we introduce a new algorithm and the corresponding tool ebwt2InDel that (i) extend the framework of [Prezza et al., AMB 2019] to detect also INDELs, and (ii) implements recent algorithmic findings that allow to perform the whole analysis using just the BWT, thus reducing the working space by one order of magnitude and allowing the analysis of full genomes. Finally, we describe a simple strategy for effectively parallelizing our tool for SNP detection only. On a 24-cores machine, the parallel version of our tool is one order of magnitude faster than the sequential one. The tool ebwt2InDel is available at github.com/nicolaprezza/ebwt2InDel. Conclusions Results on a synthetic dataset covered at 30x (Human chromosome 1) show that our tool is indeed able to find up to 83% of the SNPs and 72% of the existing INDELs. These percentages considerably improve the 71% of SNPs and 51% of INDELs found by the state-of-the art tool based on de Bruijn graphs. We furthermore report results on larger (real) Human whole-genome sequencing experiments. Also in these cases, our tool exhibits a much higher sensitivity than the state-of-the art tool.

[1]  Marie-France Sagot,et al.  Identifying SNPs without a Reference Genome by Comparing Raw Reads , 2010, SPIRE.

[2]  Marie-France Sagot,et al.  Theme: Computational Biology and Bioinformatics Computational Sciences for Biology, Medicine and the Environment , 2012 .

[3]  Raffaele Giancarlo,et al.  A New Class of Searchable and Provably Highly Compressible String Transformations , 2019, CPM.

[4]  Antonio Restivo,et al.  The Alternating BWT: an algorithmic perspective , 2019, Theor. Comput. Sci..

[5]  Esko Ukkonen,et al.  Accurate self-correction of errors in long reads using de Bruijn graphs , 2016, Bioinform..

[6]  Xiangde Zhang,et al.  The Burrows-Wheeler similarity distribution between biological sequences based on Burrows-Wheeler transform. , 2010, Journal of theoretical biology.

[7]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[8]  Richard M Leggett,et al.  Reference-free SNP detection: dealing with the data deluge , 2014, BMC Genomics.

[9]  Antonio Restivo,et al.  An extension of the Burrows-Wheeler Transform , 2007, Theor. Comput. Sci..

[10]  Rayan Chikhi,et al.  Reference-free detection of isolated SNPs , 2014, Nucleic acids research.

[11]  Antonio Restivo,et al.  From first principles to the Burrows and Wheeler transform and beyond, via combinatorial optimization , 2007, Theor. Comput. Sci..

[12]  Giovanna Rosone,et al.  Comparing DNA Sequence Collections by Direct Comparison of Compressed Text Indexes , 2012, WABI.

[13]  Roberto Grossi,et al.  Efficient Bubble Enumeration in Directed Graphs , 2012, SPIRE.

[14]  Antonio Restivo,et al.  Burrows-Wheeler transform and Sturmian words , 2003, Inf. Process. Lett..

[15]  Jens Stoye,et al.  metaBEETL: high-throughput analysis of heterogeneous microbial populations from shotgun DNA sequences , 2013, BMC Bioinformatics.

[16]  Leena Salmela,et al.  LoRDEC: accurate and efficient long read error correction , 2014, Bioinform..

[17]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[18]  Antonio Restivo,et al.  Measuring the clustering effect of BWT via RLE , 2017, Theor. Comput. Sci..

[19]  Pierre Peterlongo,et al.  DiscoSnp++: de novo detection of small variants from raw unassembled read set(s) , 2017, bioRxiv.

[20]  Pierre Peterlongo,et al.  Toward perfect reads: self-correction of short reads via mapping on de Bruijn graphs. , 2019, Bioinformatics.

[21]  Giovanna Rosone,et al.  Detecting Mutations by eBWT , 2018, WABI.

[22]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[23]  Gonzalo Navarro,et al.  Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space , 2018, J. ACM.

[24]  Raffaele Giancarlo,et al.  Boosting textual compression in optimal linear time , 2005, JACM.

[25]  Paola Bonizzoni,et al.  On the Minimum Error Correction Problem for Haplotype Assembly in Diploid and Polyploid Genomes , 2016, J. Comput. Biol..

[26]  Antonio Restivo,et al.  Burrows-Wheeler Transform and Run-Length Enconding , 2017, WORDS.

[27]  Giovanna Rosone,et al.  The Burrows-Wheeler Transform between Data Compression and Combinatorics on Words , 2013, CiE.

[28]  Giovanna Rosone,et al.  Adaptive reference-free compression of sequence quality scores , 2014, Bioinform..

[29]  Travis Gagie,et al.  Wheeler graphs: A framework for BWT-based data structures☆ , 2017, Theor. Comput. Sci..

[30]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[31]  Nuno A. Fonseca,et al.  Assemblathon 1: a competitive assessment of de novo short read assembly methods. , 2011, Genome research.

[32]  Giovanna Rosone,et al.  Lightweight algorithms for constructing and inverting the BWT of string collections , 2013, Theor. Comput. Sci..

[33]  Antonio Restivo,et al.  A New Combinatorial Approach to Sequence Comparison , 2005, Theory of Computing Systems.

[34]  Niko Välimäki,et al.  Scalable and Versatile k-mer Indexing for High-Throughput Sequencing Data , 2013, ISBRA.

[35]  Giovanna Rosone,et al.  Space-Efficient Computation of the LCP Array from the Burrows-Wheeler Transform , 2019, CPM.

[36]  Pierre Peterlongo,et al.  Mapping-Free and Assembly-Free Discovery of Inversion Breakpoints from Raw NGS Reads , 2014, AlCoB.

[37]  Antonio Restivo,et al.  Distance measures for biological sequences: Some recent approaches , 2008, Int. J. Approx. Reason..

[38]  Giovanna Rosone,et al.  Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform , 2012, Bioinform..

[39]  Antonio Restivo,et al.  Balancing and clustering of words in the Burrows-Wheeler transform , 2011, Theor. Comput. Sci..

[40]  Paola Bonizzoni,et al.  HapCol: accurate and memory-efficient haplotype assembly from long reads , 2016, Bioinform..

[41]  Tomasz Marek Kowalski,et al.  Indexing Arbitrary-Length k-Mers in Sequencing Reads , 2015, PloS one.

[42]  Shan-hui Hsu,et al.  Substrate-dependent gene regulation of self-assembled human MSC spheroids on chitosan membranes , 2013, BMC Genomics.

[43]  Tomasz Kociumaka,et al.  Resolution of the Burrows-Wheeler Transform Conjecture , 2019, 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS).

[44]  Giovanna Rosone,et al.  Lightweight Metagenomic Classification via eBWT , 2019, AlCoB.

[45]  Leo van Iersel,et al.  WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads , 2015, J. Comput. Biol..

[46]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[47]  Giovanna Rosone,et al.  Lightweight LCP construction for very large collections of strings , 2016, J. Discrete Algorithms.

[48]  Amar Mukherjee,et al.  The Burrows-Wheeler Transform:: Data Compression, Suffix Arrays, and Pattern Matching , 2008 .

[49]  Zsuzsanna Lipták,et al.  When a Dollar Makes a BWT , 2019, ICTCS.

[50]  Giovanna Rosone,et al.  SNPs detection by eBWT positional clustering , 2019, Algorithms for Molecular Biology.

[51]  Yingrui Li,et al.  SOAPindel: Efficient identification of indels from short paired reads , 2013, Genome research.

[52]  Asako Koike,et al.  Analysis of genomic rearrangements by using the Burrows-Wheeler transform of short-read data , 2015, BMC Bioinformatics.

[53]  Birgit Funke,et al.  Best practices for benchmarking germline small-variant calls in human genomes , 2019, Nature Biotechnology.

[54]  Tsachy Weissman,et al.  Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis , 2018, Bioinform..

[55]  Gonzalo Navarro,et al.  Optimal-Time Text Indexing in BWT-runs Bounded Space , 2017, SODA.

[56]  Thierry Lecroq,et al.  Querying large read collections in main memory: a versatile data structure , 2011, BMC Bioinformatics.

[57]  Zamin Iqbal,et al.  Identifying and Classifying Trait Linked Polymorphisms in Non-Reference Species by Walking Coloured de Bruijn Graphs , 2013, PloS one.

[58]  Asako Koike,et al.  Ultrafast SNP analysis using the Burrows-Wheeler transform of short-read data , 2015, Bioinform..