Disk compression of k-mer sets

K-mer based methods have become prevalent in many areas of bioinformatics. In applications such as database search, they often work with large multi-terabyte-sized datasets. Storing such large datasets is a detriment to tool developers, tool users, and reproducibility efforts. General purpose compressors like gzip, or those designed for read data, are sub-optimal because they do not take into account the specific redundancy pattern in k-mer sets. In our earlier work (Rahman and Medvedev, RECOMB 2020), we presented an algorithm UST-Compress that uses a spectrum-preserving string set representation to compress a set of k-mers to disk. In this paper, we present two improved methods for disk compression of k-mer sets, called ESS-Compress and ESS-Tip-Compress. They use a more relaxed notion of string set representation to further remove redundancy from the representation of UST-Compress. We explore their behavior both theoretically and on real data. We show that they improve the compression sizes achieved by UST-Compress by up to 27 percent, across a breadth of datasets. We also derive lower bounds on how well this type of compression strategy can hope to do.

[1]  Daniel S. Standage,et al.  Kevlar: a mapping-free framework for accurate discovery of de novo variants , 2019 .

[2]  Gregory Gutin,et al.  Digraphs - theory, algorithms and applications , 2002 .

[3]  Chen Sun,et al.  AllSome Sequence Bloom Trees , 2018, J. Comput. Biol..

[4]  Süleyman Cenk Sahinalp,et al.  Genomic Data Compression , 2019, Encyclopedia of Big Data Technologies.

[5]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[6]  Carl Kingsford,et al.  Fast Search of Thousands of Short-Read Sequencing Experiments , 2015, Nature Biotechnology.

[7]  Paola Bonizzoni,et al.  MALVA: Genotyping by Mapping-free ALlele Detection of Known VAriants , 2019, bioRxiv.

[8]  Tadashi Imanishi,et al.  Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences , 2018, bioRxiv.

[9]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[10]  Brian D. Ondov,et al.  Mash: fast genome and metagenome distance estimation using MinHash , 2015, Genome Biology.

[11]  Armando J. Pinho,et al.  A Survey on Data Compression Methods for Biological Sequences , 2016, Inf..

[12]  Faraz Hach,et al.  Comparison of high-throughput sequencing data compression tools , 2016, Nature Methods.

[13]  Costas S. Iliopoulos,et al.  Efficient Pattern Matching in Elastic-Degenerate Texts , 2020, LATA.

[14]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[15]  Amatur Rahman,et al.  Representation of k-mer sets using spectrum-preserving string sets , 2020, bioRxiv.

[16]  Phelim Bradley,et al.  COBS: a Compact Bit-Sliced Signature Index , 2019, SPIRE.

[17]  Andreas Andrusch,et al.  DREAM‐Yara: an exact read mapper for very large databases with short update time , 2018, Bioinform..

[18]  Paul Medvedev,et al.  Improved Representation of Sequence Bloom Trees , 2018, bioRxiv.

[19]  Paul Medvedev,et al.  On the representation of de Bruijn graphs , 2014, RECOMB.

[20]  Paul Medvedev,et al.  Compacting de Bruijn graphs from sequencing data quickly and in low memory , 2016, Bioinform..

[21]  Michael A. Bender,et al.  Mantis: A Fast, Small, and Exact Large-Scale Sequence-Search Index. , 2018, Cell systems.

[22]  Dominique Lavenier,et al.  DSK: k-mer counting with very low memory usage , 2013, Bioinform..

[23]  Armando J. Pinho,et al.  MFCompress: a compression tool for FASTA and multi-FASTA data , 2013, Bioinform..

[24]  Daniel Gautheret,et al.  REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets , 2020, bioRxiv.

[25]  Sebastian Deorowicz,et al.  KMC 3: counting and manipulating k‐mer statistics , 2017, Bioinform..

[26]  Paul Medvedev,et al.  Data structures to represent sets of k-long DNA sequences , 2019, ArXiv.

[27]  Michael A. Bender,et al.  Squeakr: An Exact and Approximate k-mer Counting System , 2017, bioRxiv.

[28]  Karel Brinda,et al.  Novel computational techniques for mapping and classifying Next-Generation Sequencing data. (Nouvelles techniques informatiques pour la localisation et la classification de données de séquençage haut débit) , 2016 .

[29]  Gil McVean,et al.  Integrating long-range connectivity information into de Bruijn graphs , 2017, bioRxiv.

[30]  Paul Medvedev,et al.  Improved representation of sequence bloom trees , 2020, Bioinform..

[31]  B. Berger,et al.  Targeted Genotyping of Variable Number Tandem Repeats with AdVNTR , 2018, RECOMB.

[32]  Kunihiko Sadakane,et al.  Succinct de Bruijn Graphs , 2012, WABI.

[33]  Chen Sun,et al.  Toward fast and accurate SNP genotyping from whole genome sequencing data for bedside diagnostics , 2019, Bioinform..

[34]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[35]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[36]  Daniel S. Standage,et al.  Kevlar: A Mapping-Free Framework for Accurate Discovery of De Novo Variants , 2019, bioRxiv.

[37]  Carl Kingsford,et al.  Improved Search of Large Transcriptomic Sequencing Databases Using Split Sequence Bloom Trees , 2016, bioRxiv.

[38]  Phelim Bradley,et al.  Ultra-fast search of all deposited bacterial and viral genomic data , 2019, Nature Biotechnology.

[39]  Paola Bonizzoni,et al.  MALVA: Genotyping by Mapping-free ALlele Detection of Known VAriants , 2019, iScience.

[40]  Christina Boucher,et al.  Data structures based on k-mers for querying large collections of sequencing data sets , 2019, bioRxiv.

[41]  Gregory Kucherov,et al.  Simplitigs as an efficient and scalable representation of de Bruijn graphs , 2020, Genome Biology.