Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs

Motivation De Bruijn graphs are the core data structure for a wide range of assemblers and genome analysis software processing High Throughput Sequencing datasets. For population genomic analysis, the colored de Bruijn graph is often used in order to take advantage of the massive sets of sequenced genomes available for each species. However, memory consumption of tools based on the de Bruijn graph is often prohibitive, due to the high number of vertices, edges or colors in the graph. In order to process large and complex genomes, most short-read assemblers based on the de Bruijn graph paradigm reduce the assembly complexity and memory usage by compacting first all maximal non-branching paths of the graph into single vertices. Yet, de Bruijn graph compaction is challenging as it requires the uncompacted de Bruijn graph to be available in memory. Results We present a new parallel and memory efficient algorithm enabling the direct construction of the compacted de Bruijn graph without producing the intermediate uncompacted de Bruijn graph. Bifrost features a broad range of functions such as sequence querying, storage of user data alongside vertices and graph editing that automatically preserve the compaction property. Bifrost makes full use of the dynamic index efficiency and proposes a graph coloring method efficiently mapping each k-mer of the graph to the set of genomes in which it occurs. Experimental results show that our algorithm is competitive with state-of-the-art de Bruijn graph compaction and coloring tools. Bifrost was able to build the colored and compacted de Bruijn graph of about 118,000 Salmonella genomes on a mid-class server in about 4 days using 103 GB of main memory. Availability https://github.com/pmelsted/bifrost available with a BSD-2 license Contact guillaumeholley@gmail.com

[1]  Carl Kingsford,et al.  Fast Search of Thousands of Short-Read Sequencing Experiments , 2015, Nature Biotechnology.

[2]  Dick de Ridder,et al.  PanTools: representation, storage and exploration of pan-genomic data , 2016, Bioinform..

[3]  Rayan Chikhi,et al.  Fast and scalable minimal perfect hashing for massive key sets , 2017, SEA.

[4]  Nan Li,et al.  Comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-bruijn-graph. , 2012, Briefings in functional genomics.

[5]  Dominique Lavenier,et al.  Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph , 2015, BMC Bioinformatics.

[6]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[7]  Ewa A. Bergmann,et al.  Indel variant analysis of short-read sequencing data with Scalpel , 2015, Nature Protocols.

[8]  M. Achtman,et al.  The user’s guide to comparative genomics with EnteroBase, including case studies on transmissions of micro-clades of Salmonella, the phylogeny of ancient and modern Yersinia pestis genomes, and the core genomic diversity of all Escherichia , 2019, bioRxiv.

[9]  Pierre Peterlongo,et al.  Toward perfect reads: self-correction of short reads via mapping on de Bruijn graphs. , 2019, Bioinformatics.

[10]  Prashant Pandey,et al.  Rainbowfish: A Succinct Colored de Bruijn Graph Representation , 2017, bioRxiv.

[11]  Eli Upfal,et al.  Balanced Allocations , 1999, SIAM J. Comput..

[12]  Sebastian Deorowicz,et al.  KMC 2: Fast and resource-frugal k-mer counting , 2014, Bioinform..

[13]  Alexa B. R. McIntyre,et al.  Extensive sequencing of seven human genomes to characterize benchmark reference materials , 2015, Scientific Data.

[14]  Nikolay Vyahhi,et al.  Sibelia: A Scalable and Comprehensive Synteny Block Generation Tool for Closely Related Microbial Genomes , 2013, WABI.

[15]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[16]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[17]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[18]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[19]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.

[20]  Michael C. Schatz,et al.  SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips , 2014, Bioinform..

[21]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[22]  Enno Ohlebusch,et al.  Graphical pan-genome analysis with compressed suffix trees and the Burrows-Wheeler transform , 2016, Bioinform..

[23]  Szymon Grabowski,et al.  Disk-based compression of data from genome sequencing , 2015, Bioinform..

[24]  Chen Sun,et al.  AllSome Sequence Bloom Trees , 2018, J. Comput. Biol..

[25]  Owen Kaser,et al.  Recursive n-gram hashing is pairwise independent, at best , 2007, Comput. Speech Lang..

[26]  Ye Yu,et al.  SeqOthello: querying RNA-seq experiments at scale , 2018, Genome Biology.

[27]  Steven J. M. Jones,et al.  De novo assembly and analysis of RNA-seq data , 2010, Nature Methods.

[28]  Heng Li,et al.  Fast and accurate long-read assembly with wtdbg2 , 2019, Nature Methods.

[29]  Prashant Pandey,et al.  An Efficient, Scalable and Exact Representation of High-Dimensional Color Information Enabled via de Bruijn Graph Search , 2018, bioRxiv.

[30]  Daniel Standage,et al.  The khmer software package: enabling efficient nucleotide sequence analysis , 2015, F1000Research.

[31]  Lior Pachter,et al.  Near-optimal probabilistic RNA-seq quantification , 2016, Nature Biotechnology.

[32]  Peter Sanders,et al.  Cache-, hash-, and space-efficient bloom filters , 2009, JEAL.

[33]  Michael Mitzenmacher,et al.  Less hashing, same performance: Building a better Bloom filter , 2006, Random Struct. Algorithms.

[34]  M. Achtman,et al.  The user’s guide to comparative genomics with EnteroBase, including case studies on transmissions of micro-clades of Salmonella, the phylogeny of ancient and modern Yersinia pestis genomes, and the core genomic diversity of all Escherichia , 2019 .

[35]  Christina Boucher,et al.  Data structures based on k-mers for querying large collections of sequencing data sets , 2019, bioRxiv.

[36]  Rayan Chikhi,et al.  Reference-free detection of isolated SNPs , 2014, Nucleic acids research.

[37]  Heng Li,et al.  Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences , 2015, Bioinform..

[38]  Carl Kingsford,et al.  Improved Search of Large Transcriptomic Sequencing Databases Using Split Sequence Bloom Trees , 2016, bioRxiv.

[39]  Roland Wittler,et al.  Alignment- and reference-free phylogenomics with colored de Bruijn graphs , 2019, Algorithms for Molecular Biology.

[40]  Bo Liu,et al.  deGSM: memory scalable construction of large scale de Bruijn Graph , 2018, bioRxiv.

[41]  Owen Kaser,et al.  Better bitmap performance with Roaring bitmaps , 2014, Softw. Pract. Exp..

[42]  W. Kloosterman,et al.  From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy , 2018, Genome Biology.

[43]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[44]  Leonard McMillan,et al.  Merging of multi-string BWTs with applications , 2014, Bioinform..

[45]  Páll Melsted,et al.  Efficient counting of k-mers in DNA sequences using a bloom filter , 2011, BMC Bioinformatics.

[46]  Christina Boucher,et al.  Building large updatable colored de Bruijn graphs via merging , 2019, Bioinform..

[47]  Camille Marchet,et al.  Indexing De Bruijn graphs with minimizers , 2019, RECOMB 2019.

[48]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[49]  Jens Stoye,et al.  Bloom Filter Trie - A Data Structure for Pan-Genome Storage , 2015, WABI.

[50]  Kin-Fan Au,et al.  PacBio Sequencing and Its Applications , 2015, Genom. Proteom. Bioinform..

[51]  Yadong Wang,et al.  deBGA: read alignment with de Bruijn graph-based seed and extension , 2016, Bioinform..

[52]  Mark J. P. Chaisson,et al.  Short read fragment assembly of bacterial genomes. , 2008, Genome research.

[53]  Rayan Chikhi,et al.  Space-efficient and exact de Bruijn graph representation based on a Bloom filter , 2012, Algorithms for Molecular Biology.

[54]  Michael S. Waterman,et al.  A New Algorithm for DNA Sequence Assembly , 1995, J. Comput. Biol..

[55]  Leena Salmela,et al.  LoRDEC: accurate and efficient long read error correction , 2014, Bioinform..

[56]  Paul Medvedev,et al.  Data structures to represent sets of k-long DNA sequences , 2019, ArXiv.

[57]  Michael A. Bender,et al.  Squeakr: An Exact and Approximate k-mer Counting System , 2017, bioRxiv.

[58]  A. Gnirke,et al.  ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads , 2009, Genome Biology.

[59]  Dominique Lavenier,et al.  GATB: Genome Assembly & Analysis Tool Box , 2014, Bioinform..

[60]  Phelim Bradley,et al.  Ultra-fast search of all deposited bacterial and viral genomic data , 2019, Nature Biotechnology.

[61]  Christina Boucher,et al.  Succinct Colored de Bruijn Graphs , 2016, bioRxiv.

[62]  Jens Stoye,et al.  Pan-Genome Storage and Analysis Techniques. , 2018, Methods in molecular biology.

[63]  Ilan Shomorony,et al.  HINGE: Long-Read Assembly Achieves Optimal Repeat Resolution , 2016, bioRxiv.

[64]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[65]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[66]  Paul Medvedev,et al.  TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes , 2016, Bioinform..

[67]  Prashant Pandey,et al.  An Efficient, Scalable and Exact Representation of High-Dimensional Color Information Enabled via de Bruijn Graph Search , 2019, RECOMB.

[68]  Paul Medvedev,et al.  Compacting de Bruijn graphs from sequencing data quickly and in low memory , 2016, Bioinform..

[69]  Michael Roberts,et al.  Reducing storage requirements for biological sequence comparison , 2004, Bioinform..

[70]  Guillaume Holley,et al.  BlastFrost: fast querying of 100,000s of bacterial genomes in Bifrost graphs , 2020, Genome Biology.