Space efficient merging of de Bruijn graphs and Wheeler graphs

The merging of succinct data structures is a well established technique for the space efficient construction of large succinct indexes. In the first part of the paper we propose a new algorithm for merging succinct representations of de Bruijn graphs. Our algorithm has the same asymptotic cost of the state of the art algorithm for the same problem but it uses less than half of its working space. A novel important feature of our algorithm, not found in any of the existing tools, is that it can compute the Variable Order succinct representation of the union graph within the same asymptotic time/space bounds. In the second part of the paper we consider the more general problem of merging succinct representations of Wheeler graphs, a recently introduced graph family which includes as special cases de Bruijn graphs and many other known succinct indexes based on the BWT or one of its variants. We show that Wheeler graphs merging is in general a much more difficult problem, and we provide a space efficient algorithm for the slightly simplified problem of determining whether the union graph has an ordering that satisfies the Wheeler conditions.

[1]  Richard Durbin,et al.  Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT) , 2014, Bioinform..

[2]  Fabio Cunial,et al.  Fully-functional bidirectional Burrows-Wheeler indexes , 2019, CPM.

[3]  Fabrizio Luccio,et al.  Compressing and indexing labeled trees, with applications , 2009, JACM.

[4]  Leonard McMillan,et al.  Constructing burrows-wheeler transforms of large string collections via merging , 2014, BCB.

[5]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets , 2007, ACM Trans. Algorithms.

[6]  Gonzalo Navarro,et al.  Compressed representations of sequences and full-text indexes , 2007, TALG.

[7]  Giovanni Manzini,et al.  Lightweight BWT and LCP Merging via the Gap Algorithm , 2017, SPIRE.

[8]  Alberto Policriti,et al.  Regular Languages meet Prefix Sorting , 2019, SODA.

[9]  Prashant Pandey,et al.  Rainbowfish: A Succinct Colored de Bruijn Graph Representation , 2017, bioRxiv.

[10]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Antonio Restivo,et al.  An extension of the Burrows-Wheeler Transform , 2007, Theor. Comput. Sci..

[12]  Leonard McMillan,et al.  Merging of multi-string BWTs with applications , 2014, Bioinform..

[13]  Michael C. Schatz,et al.  SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips , 2014, Bioinform..

[14]  Meng He,et al.  Indexing Compressed Text , 2003 .

[15]  Giovanni Manzini,et al.  Space-efficient merging of succinct de Bruijn graphs , 2019, SPIRE.

[16]  Marco Previtali,et al.  Bidirectional Variable-Order de Bruijn Graphs , 2016, LATIN.

[17]  Travis Gagie,et al.  Wheeler graphs: A framework for BWT-based data structures☆ , 2017, Theor. Comput. Sci..

[18]  Christina Boucher,et al.  Succinct Colored de Bruijn Graphs , 2016, bioRxiv.

[19]  Gonzalo Navarro,et al.  Optimal Lower and Upper Bounds for Representing Sequences , 2011, TALG.

[20]  Kunihiko Sadakane,et al.  Succinct de Bruijn Graphs , 2012, WABI.

[21]  Travis Gagie,et al.  Lightweight Data Indexing and Compression in External Memory , 2009, Algorithmica.

[22]  Christina Boucher,et al.  Recoloring the Colored de Bruijn Graph , 2018, SPIRE.

[23]  Giovanni Manzini,et al.  Lightweight merging of compressed indices based on BWT variants , 2019, Theor. Comput. Sci..

[24]  Jouni Sirén Burrows-Wheeler Transform for Terabases , 2016, 2016 Data Compression Conference (DCC).

[25]  Travis Gagie,et al.  Relaxing Wheeler Graphs for Indexing Reads , 2018 .

[26]  Robert E. Tarjan,et al.  A Linear-Time Algorithm for Testing the Truth of Certain Quantified Boolean Formulas , 1979, Inf. Process. Lett..

[27]  Juha Kärkkäinen,et al.  Engineering a Lightweight External Memory Suffix Array Construction Algorithm , 2017, ICABD.

[28]  Gonzalo Navarro,et al.  Tunneling on Wheeler Graphs , 2018, 2019 Data Compression Conference (DCC).

[29]  Christina Boucher,et al.  Succinct De Bruijn Graph Construction for Massive Populations Through Space-Efficient Merging , 2017, bioRxiv.

[30]  Rayan Chikhi,et al.  Space-efficient and exact de Bruijn graph representation based on a Bloom filter , 2012, Algorithms for Molecular Biology.

[31]  Sharma V. Thankachan,et al.  On the Hardness and Inapproximability of Recognizing Wheeler Graphs , 2019, ESA.

[32]  Joong Chae Na,et al.  FM-index of Alignment with Gaps , 2016, Theor. Comput. Sci..

[33]  Paolo Ferragina,et al.  Compressed permuterm index , 2007, SIGIR.

[34]  Jouni Sirén,et al.  Indexing Variation Graphs , 2016, ALENEX.

[35]  Christina Boucher,et al.  Variable-Order de Bruijn Graphs , 2014, 2015 Data Compression Conference.

[36]  Uwe Baier,et al.  BWT Tunnel Planning is Hard But Manageable , 2019, 2019 Data Compression Conference (DCC).

[37]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[38]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[39]  Christina Boucher,et al.  Building large updatable colored de Bruijn graphs via merging , 2019, Bioinform..