Efficient Construction of a Compressed de Bruijn Graph for Pan-Genome Analysis

Recently, Marcus et al. (Bioinformatics 2014) proposed to use a compressed de Bruijn graph of maximal exact matches to describe the relationship between the genomes of many individuals/strains of the same or closely related species. They devised an \(O(n\log g)\) time algorithm called splitMEM that constructs this graph directly (i.e., without using the uncompressed de Bruijn graph) based on a suffix tree, where \(n\) is the total length of the genomes and \(g\) is the length of the longest genome. In this paper, we present an algorithm that outperforms their algorithm in theory and in practice. More precisely, our algorithm has a better worst-case time complexity of \(O(n\log \sigma )\), where \(\sigma \) is the size of the alphabet (\(\sigma = 4\) for DNA). Moreover, experiments show that it is much faster than splitMEM while using only a fraction of the space required by splitMEM.

[1]  William F. Smyth,et al.  A taxonomy of suffix array construction algorithms , 2007, CSUR.

[2]  Alistair Moffat,et al.  From Theory to Practice: Plug and Play with Succinct Data Structures , 2013, SEA.

[3]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[4]  M. Gerstein,et al.  AlleleSeq: analysis of allele-specific expression and binding in a network framework , 2011, Molecular systems biology.

[5]  Enno Ohlebusch,et al.  Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction , 2013 .

[6]  Enno Ohlebusch,et al.  Computing the longest common prefix array based on the Burrows-Wheeler transform , 2013, J. Discrete Algorithms.

[7]  Juha Kärkkäinen,et al.  Fast BWT in small space by blockwise suffix sorting , 2007, Theor. Comput. Sci..

[8]  Thierry Lecroq,et al.  From Indexing Data Structures to de Bruijn Graphs , 2014, CPM.

[9]  N. Warthmann,et al.  Simultaneous alignment of short reads against multiple genomes , 2009, Genome Biology.

[10]  Kunihiko Sadakane,et al.  A Linear-Time Burrows-Wheeler Transform Using Induced Sorting , 2009, SPIRE.

[11]  Gonzalo Navarro,et al.  New algorithms on wavelet trees and applications to information retrieval , 2010, Theor. Comput. Sci..

[12]  Niko Välimäki,et al.  Scalable and Versatile k-mer Indexing for High-Throughput Sequencing Data , 2013, ISBRA.

[13]  Knut Reinert,et al.  Journaled string tree - a scalable data structure for analyzing thousands of similar genomes on your laptop , 2014, Bioinform..

[14]  Travis Gagie,et al.  Lightweight Data Indexing and Compression in External Memory , 2010, LATIN.

[15]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[16]  Enno Ohlebusch,et al.  Space-Efficient Construction of the Burrows-Wheeler Transform , 2013, SPIRE.

[17]  Johann van der Merwe,et al.  A survey on peer-to-peer key management for mobile ad hoc networks , 2007, CSUR.

[18]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[19]  Guy Jacobson,et al.  Space-efficient static trees and graphs , 1989, 30th Annual Symposium on Foundations of Computer Science.

[20]  Michael C. Schatz,et al.  SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips , 2014, Bioinform..

[21]  Gonzalo Navarro,et al.  Faster Compressed Suffix Trees for Repetitive Collections , 2016, ACM J. Exp. Algorithmics.

[22]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[23]  Lin Huang,et al.  Short read alignment with populations of genomes , 2013, Bioinform..

[24]  Gonzalo Navarro,et al.  Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections , 2008, SPIRE.

[25]  Enno Ohlebusch,et al.  Computing Matching Statistics and Maximal Exact Matches on Compressed Full-Text Indexes , 2010, SPIRE.

[26]  Gonzalo Navarro,et al.  Faster Compressed Suffix Trees for Repetitive Text Collections , 2014, SEA.