Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage

BackgroundHigh throughput sequencing technologies have become fast and cheap in the past years. As a result, large-scale projects started to sequence tens to several thousands of genomes per species, producing a high number of sequences sampled from each genome. Such a highly redundant collection of very similar sequences is called a pan-genome. It can be transformed into a set of sequences “colored” by the genomes to which they belong. A colored de Bruijn graph (C-DBG) extracts from the sequences all colored k-mers, strings of length k, and stores them in vertices.ResultsIn this paper, we present an alignment-free, reference-free and incremental data structure for storing a pan-genome as a C-DBG: the bloom filter trie (BFT). The data structure allows to store and compress a set of colored k-mers, and also to efficiently traverse the graph. Bloom filter trie was used to index and query different pangenome datasets. Compared to another state-of-the-art data structure, BFT was up to two times faster to build while using about the same amount of main memory. For querying k-mers, BFT was about 52–66 times faster while using about 5.5–14.3 times less memory.ConclusionWe present a novel succinct data structure called the Bloom Filter Trie for indexing a pan-genome as a colored de Bruijn graph. The trie stores k-mers and their colors based on a new representation of vertices that compress and index shared substrings. Vertices use basic data structures for lightweight substrings storage as well as Bloom filters for efficient trie and graph traversals. Experimental results prove better performance compared to another state-of-the-art data structure.Availabilityhttps://www.github.com/GuillaumeHolley/BloomFilterTrie.

[1]  Carl Kingsford,et al.  Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees , 2015, bioRxiv.

[2]  Michael C. Schatz,et al.  SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips , 2014, Bioinform..

[3]  Arend Hintze,et al.  Scaling metagenome sequence assembly with probabilistic de Bruijn graphs , 2011, Proceedings of the National Academy of Sciences.

[4]  David Haussler,et al.  Building a Pangenome Reference for a Population , 2014, RECOMB.

[5]  Jens Stoye,et al.  Bloom Filter Trie - A Data Structure for Pan-Genome Storage , 2015, WABI.

[6]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[7]  David Haussler,et al.  Cactus Graphs for Genome Comparisons , 2010, RECOMB.

[8]  Eugene W. Myers,et al.  The fragment assembly string graph , 2005, ECCB/JBI.

[9]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[10]  Enno Ohlebusch,et al.  Graphical pan-genome analysis with compressed suffix trees and the Burrows-Wheeler transform , 2016, Bioinform..

[11]  Ulf Leser,et al.  RCSI: Scalable similarity search in thousand(s) of genomes , 2013, Proc. VLDB Endow..

[12]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[13]  David Haussler,et al.  Building a Pan-Genome Reference for a Population , 2015, J. Comput. Biol..

[14]  Edward Fredkin,et al.  Trie memory , 1960, Commun. ACM.

[15]  Giovanna Rosone,et al.  Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform , 2012, Bioinform..

[16]  Ulf Leser,et al.  MRCSI: Compressing and Searching String Collections with Multiple References , 2015, Proc. VLDB Endow..

[17]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees and multisets , 2002, SODA '02.

[18]  Daniel Standage,et al.  The khmer software package: enabling efficient nucleotide sequence analysis , 2015, F1000Research.

[19]  Paul Medvedev,et al.  Informed and automated k-mer size selection for genome assembly , 2013, Bioinform..

[20]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[21]  Hugh E. Williams,et al.  Burst tries: a fast, efficient data structure for string keys , 2002, TOIS.

[22]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[23]  Sven Rahmann,et al.  PanCake: A Data Structure for Pangenomes , 2013, GCB.

[24]  Lin Huang,et al.  Short read alignment with populations of genomes , 2013, Bioinform..

[25]  Giovanni Manzini,et al.  An experimental study of an opportunistic index , 2001, SODA '01.