Representation of k-mer sets using spectrum-preserving string sets

Given the popularity and elegance of k-mer based tools, finding a space-efficient way to represent a set of k-mers is important for improving the scalability of bioinformatics analyses. One popular approach is to convert the set of k-mers into the more compact set of unitigs. We generalize this approach and formulate it as the problem of finding a smallest spectrum-preserving string set (SPSS) representation. We show that this problem is equivalent to finding a smallest path cover in a compacted de Bruijn graph. Using this reduction, we prove a lower bound on the size of the optimal SPSS and propose a greedy method called UST that results in a smaller representation than unitigs and is nearly optimal with respect to our lower bound. We demonstrate the usefulness of the SPSS formulation with two applications of UST. The first one is a compression algorithm, UST-Compress, which we show can store a set of k-mers using an order-of-magnitude less disk space than other lossless compression tools. The second one is an exact static k-mer membership index, UST-FM, which we show improves index size by 10-44% compared to other state-of-the-art low memory indices. Our tool is publicly available at: https://github.com/medvedevgroup/UST/.

[1]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[2]  Chen Sun,et al.  Toward fast and accurate SNP genotyping from whole genome sequencing data for bedside diagnostics , 2019, Bioinform..

[3]  Paola Bonizzoni,et al.  MALVA: Genotyping by Mapping-free ALlele Detection of Known VAriants , 2019, iScience.

[4]  Colin N. Dewey,et al.  De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis , 2013, Nature Protocols.

[5]  Karel Brinda,et al.  Novel computational techniques for mapping and classifying Next-Generation Sequencing data. (Nouvelles techniques informatiques pour la localisation et la classification de données de séquençage haut débit) , 2016 .

[6]  Carl Kingsford,et al.  Fast Search of Thousands of Short-Read Sequencing Experiments , 2015, Nature Biotechnology.

[7]  Marco Previtali,et al.  Fully Dynamic de Bruijn Graphs , 2016, SPIRE.

[8]  Sebastian Deorowicz,et al.  KMC 3: counting and manipulating k‐mer statistics , 2017, Bioinform..

[9]  Tony Pan,et al.  Fast de Bruijn Graph Compaction in Distributed Memory Environments , 2020, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[10]  Srinivas Aluru,et al.  A survey of error-correction methods for next-generation sequencing , 2013, Briefings Bioinform..

[11]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[12]  Armando J. Pinho,et al.  A Survey on Data Compression Methods for Biological Sequences , 2016, Inf..

[13]  Paul Medvedev,et al.  Improved Representation of Sequence Bloom Trees , 2018, bioRxiv.

[14]  Jens Stoye,et al.  Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage , 2016, Algorithms for Molecular Biology.

[15]  Brian D. Ondov,et al.  Mash: fast genome and metagenome distance estimation using MinHash , 2015, Genome Biology.

[16]  Paul Medvedev,et al.  On the Representation of de Bruijn Graphs , 2014, RECOMB.

[17]  Thomas C. Conway,et al.  Succinct data structures for assembling large genomes , 2010, Bioinform..

[18]  Bo Liu,et al.  deGSM: memory scalable construction of large scale de Bruijn Graph , 2018, bioRxiv.

[19]  Paul Medvedev,et al.  Compacting de Bruijn graphs from sequencing data quickly and in low memory , 2016, Bioinform..

[20]  Armando J. Pinho,et al.  MFCompress: a compression tool for FASTA and multi-FASTA data , 2013, Bioinform..

[21]  Páll Melsted,et al.  Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs , 2019, Genome Biology.

[22]  Paul Medvedev,et al.  Data structures to represent sets of k-long DNA sequences , 2019, ArXiv.

[23]  Ron Shamir,et al.  Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing , 2017, PLoS Comput. Biol..

[24]  Will P. M. Rowe,et al.  When the levee breaks: a practical guide to sketching algorithms for processing the flood of genomic data , 2019, Genome Biology.

[25]  Paul Medvedev,et al.  Modeling Biological Problems in Computer Science: A Case Study in Genome Assembly , 2017, Briefings Bioinform..

[26]  Phelim Bradley,et al.  Ultra-fast search of all deposited bacterial and viral genomic data , 2019, Nature Biotechnology.

[27]  Faraz Hach,et al.  Comparison of high-throughput sequencing data compression tools , 2016, Nature Methods.

[28]  Kunihiko Sadakane,et al.  Succinct de Bruijn Graphs , 2012, WABI.

[29]  Kateryna D. Makova,et al.  DiscoverY: a classifier for identifying Y chromosome sequences in male assemblies , 2019, BMC Genomics.

[30]  Süleyman Cenk Sahinalp,et al.  Genomic Data Compression , 2019, Encyclopedia of Big Data Technologies.

[31]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[32]  Dominique Lavenier,et al.  DSK: k-mer counting with very low memory usage , 2013, Bioinform..

[33]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[34]  Fatemeh Almodaresi,et al.  A space and time-efficient index for the compacted colored de Bruijn graph , 2017, bioRxiv.

[35]  Camille Marchet,et al.  Indexing De Bruijn graphs with minimizers , 2019, RECOMB 2019.

[36]  Gregory Kucherov,et al.  Simplitigs as an efficient and scalable representation of de Bruijn graphs , 2020, Genome Biology.

[37]  Phelim Bradley,et al.  COBS: a Compact Bit-Sliced Signature Index , 2019, SPIRE.

[38]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[39]  Walter L. Ruzzo,et al.  Compression of next-generation sequencing reads aided by highly efficient de novo assembly , 2012, Nucleic acids research.

[40]  Michael A. Bender,et al.  Squeakr: An Exact and Approximate k-mer Counting System , 2017, bioRxiv.

[41]  Carl Kingsford,et al.  Sketching and Sublinear Data Structures in Genomics , 2019, Annual Review of Biomedical Data Science.

[42]  Christina Boucher,et al.  Practical dynamic de Bruijn graphs , 2018, Bioinform..

[43]  Michael A. Bender,et al.  A General-Purpose Counting Filter: Making Every Bit Count , 2017, SIGMOD Conference.

[44]  Paul Medvedev,et al.  De novo clustering of long-read transcriptome data using a greedy, quality-value based algorithm , 2018, bioRxiv.

[45]  Marco Previtali,et al.  Bidirectional Variable-Order de Bruijn Graphs , 2016, LATIN.

[46]  Paul Medvedev,et al.  On the representation of de Bruijn graphs , 2014, RECOMB.

[47]  Michael A. Bender,et al.  deBGR: an efficient and near-exact representation of the weighted de Bruijn graph , 2017, Bioinform..

[48]  Christina Boucher,et al.  Variable-Order de Bruijn Graphs , 2014, 2015 Data Compression Conference.

[49]  Yu Lin,et al.  Assembly of long, error-prone reads using repeat graphs , 2018, Nature Biotechnology.

[50]  Daniel S. Standage,et al.  Kevlar: A Mapping-Free Framework for Accurate Discovery of De Novo Variants , 2019, bioRxiv.