deBGR: an efficient and near-exact representation of the weighted de Bruijn graph

Motivation: Almost all de novo short‐read genome and transcriptome assemblers start by building a representation of the de Bruijn Graph of the reads they are given as input. Even when other approaches are used for subsequent assembly (e.g. when one is using ‘long read’ technologies like those offered by PacBio or Oxford Nanopore), efficient k‐mer processing is still crucial for accurate assembly, and state‐of‐the‐art long‐read error‐correction methods use de Bruijn Graphs. Because of the centrality of de Bruijn Graphs, researchers have proposed numerous methods for representing de Bruijn Graphs compactly. Some of these proposals sacrifice accuracy to save space. Further, none of these methods store abundance information, i.e. the number of times that each k‐mer occurs, which is key in transcriptome assemblers. Results: We present a method for compactly representing the weighted de Bruijn Graph (i.e. with abundance information) with essentially no errors. Our representation yields zero errors while increasing the space requirements by less than 18‐28% compared to the approximate de Bruijn graph representation in Squeakr. Our technique is based on a simple invariant that all weighted de Bruijn Graphs must satisfy, and hence is likely to be of general interest and applicable in most weighted de Bruijn Graph‐based systems. Availability and implementation: https://github.com/splatlab/debgr. Contact: rob.patro@cs.stonybrook.edu Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Páll Melsted,et al.  Efficient counting of k-mers in DNA sequences using a bloom filter , 2011, BMC Bioinformatics.

[2]  Qingpeng Zhang,et al.  These Are Not the K-mers You Are Looking For: Efficient Online K-mer Counting Using a Probabilistic Data Structure , 2013, PloS one.

[3]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[4]  Pavel A Pevzner,et al.  How to apply de Bruijn graphs to genome assembly. , 2011, Nature biotechnology.

[5]  Michael A. Bender,et al.  A General-Purpose Counting Filter: Making Every Bit Count , 2017, SIGMOD Conference.

[6]  Gregory Kucherov,et al.  Using Cascading Bloom Filters to Improve the Memory Usage for de Brujin Graphs , 2013, WABI.

[7]  Michael A. Bender,et al.  Squeakr: An Exact and Approximate k-mer Counting System , 2017 .

[8]  Michael A. Bender,et al.  Don't Thrash: How to Cache Your Hash on Flash , 2011, Proc. VLDB Endow..

[9]  Martin Vingron,et al.  Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels , 2012, Bioinform..

[10]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[11]  Arend Hintze,et al.  Scaling metagenome sequence assembly with probabilistic de Bruijn graphs , 2011, Proceedings of the National Academy of Sciences.

[12]  Esko Ukkonen,et al.  Accurate self-correction of errors in long reads using de Bruijn graphs , 2016, Bioinform..

[13]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[14]  Darya Filippova,et al.  Improving Bloom Filter Performance on Sequence Data Using k -mer Bloom Filters , 2016, RECOMB.

[15]  Cheng Soon Ong,et al.  kWIP: The k-mer Weighted Inner Product, a de novo Estimator of Genetic Similarity , 2016 .

[16]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[17]  Rita Casadio,et al.  Algorithms in Bioinformatics, 5th International Workshop, WABI 2005, Mallorca, Spain, October 3-6, 2005, Proceedings , 2005, WABI.

[18]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[19]  Rayan Chikhi,et al.  Space-efficient and exact de Bruijn graph representation based on a Bloom filter , 2012, Algorithms for Molecular Biology.

[20]  Paul Medvedev,et al.  On the representation of de Bruijn graphs , 2014, RECOMB.

[21]  Ting Yu,et al.  BinPacker: Packing-Based De Novo Transcriptome Assembly from RNA-seq Data , 2016, PLoS Comput. Biol..

[22]  Bin Fan,et al.  Cuckoo Filter: Practically Better Than Bloom , 2014, CoNEXT.

[23]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[24]  Marco Previtali,et al.  Fully Dynamic de Bruijn Graphs , 2016, SPIRE.

[25]  Gabriel Goldstein,et al.  Improved assembly of noisy long reads by k-mer validation , 2016, bioRxiv.

[26]  Kayvon Mazooji,et al.  Shannon: An Information-Optimal de Novo RNA-Seq Assembler , 2016, bioRxiv.

[27]  Kunihiko Sadakane,et al.  Succinct de Bruijn Graphs , 2012, WABI.

[28]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[29]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Xiuzhen Huang,et al.  Bridger: a new framework for de novo transcriptome assembly using RNA-seq data , 2015, Genome Biology.

[31]  Carl Kingsford,et al.  Fast Search of Thousands of Short-Read Sequencing Experiments , 2015, Nature Biotechnology.

[32]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.