论文信息 - Disk compression of k-mer sets - 字舞流文

Disk compression of k-mer sets

K-mer based methods have become prevalent in many areas of bioinformatics. In applications such as database search, they often work with large multi-terabyte-sized datasets. Storing such large datasets is a detriment to tool developers, tool users, and reproducibility efforts. General purpose compressors like gzip, or those designed for read data, are sub-optimal because they do not take into account the specific redundancy pattern in k-mer sets. In our earlier work (Rahman and Medvedev, RECOMB 2020), we presented an algorithm UST-Compress that uses a spectrum-preserving string set representation to compress a set of k-mers to disk. In this paper, we present two improved methods for disk compression of k-mer sets, called ESS-Compress and ESS-Tip-Compress. They use a more relaxed notion of string set representation to further remove redundancy from the representation of UST-Compress. We explore their behavior both theoretically and on real data. We show that they improve the compression sizes achieved by UST-Compress by up to 27 percent, across a breadth of datasets. We also derive lower bounds on how well this type of compression strategy can hope to do.

Amatur Rahman | Paul Medvedev | Rayan Chikhi

[1] Daniel S. Standage,et al. Kevlar: a mapping-free framework for accurate discovery of de novo variants , 2019 .

[2] Gregory Gutin,et al. Digraphs - theory, algorithms and applications , 2002 .

[3] Chen Sun,et al. AllSome Sequence Bloom Trees , 2018, J. Comput. Biol..

[4] Süleyman Cenk Sahinalp,et al. Genomic Data Compression , 2019, Encyclopedia of Big Data Technologies.

[5] Derrick E. Wood,et al. Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[6] Carl Kingsford,et al. Fast Search of Thousands of Short-Read Sequencing Experiments , 2015, Nature Biotechnology.

[7] Paola Bonizzoni,et al. MALVA: Genotyping by Mapping-free ALlele Detection of Known VAriants , 2019, bioRxiv.

[8] Tadashi Imanishi,et al. Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences , 2018, bioRxiv.

[9] Sergey I. Nikolenko,et al. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[10] Brian D. Ondov,et al. Mash: fast genome and metagenome distance estimation using MinHash , 2015, Genome Biology.

[11] Armando J. Pinho,et al. A Survey on Data Compression Methods for Biological Sequences , 2016, Inf..

[12] Faraz Hach,et al. Comparison of high-throughput sequencing data compression tools , 2016, Nature Methods.

[13] Costas S. Iliopoulos,et al. Efficient Pattern Matching in Elastic-Degenerate Texts , 2020, LATA.

[14] Carl Kingsford,et al. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[15] Amatur Rahman,et al. Representation of k-mer sets using spectrum-preserving string sets , 2020, bioRxiv.

[16] Phelim Bradley,et al. COBS: a Compact Bit-Sliced Signature Index , 2019, SPIRE.

[17] Andreas Andrusch,et al. DREAM‐Yara: an exact read mapper for very large databases with short update time , 2018, Bioinform..

[18] Paul Medvedev,et al. Improved Representation of Sequence Bloom Trees , 2018, bioRxiv.

[19] Paul Medvedev,et al. On the representation of de Bruijn graphs , 2014, RECOMB.

[20] Paul Medvedev,et al. Compacting de Bruijn graphs from sequencing data quickly and in low memory , 2016, Bioinform..

[21] Michael A. Bender,et al. Mantis: A Fast, Small, and Exact Large-Scale Sequence-Search Index. , 2018, Cell systems.

[22] Dominique Lavenier,et al. DSK: k-mer counting with very low memory usage , 2013, Bioinform..

[23] Armando J. Pinho,et al. MFCompress: a compression tool for FASTA and multi-FASTA data , 2013, Bioinform..

[24] Daniel Gautheret,et al. REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets , 2020, bioRxiv.

[25] Sebastian Deorowicz,et al. KMC 3: counting and manipulating k‐mer statistics , 2017, Bioinform..

[26] Paul Medvedev,et al. Data structures to represent sets of k-long DNA sequences , 2019, ArXiv.

[27] Michael A. Bender,et al. Squeakr: An Exact and Approximate k-mer Counting System , 2017, bioRxiv.

[28] Karel Brinda,et al. Novel computational techniques for mapping and classifying Next-Generation Sequencing data. (Nouvelles techniques informatiques pour la localisation et la classification de données de séquençage haut débit) , 2016 .

[29] Gil McVean,et al. Integrating long-range connectivity information into de Bruijn graphs , 2017, bioRxiv.

[30] Paul Medvedev,et al. Improved representation of sequence bloom trees , 2020, Bioinform..

[31] B. Berger,et al. Targeted Genotyping of Variable Number Tandem Repeats with AdVNTR , 2018, RECOMB.

[32] Kunihiko Sadakane,et al. Succinct de Bruijn Graphs , 2012, WABI.

[33] Chen Sun,et al. Toward fast and accurate SNP genotyping from whole genome sequencing data for bedside diagnostics , 2019, Bioinform..

[34] M. Schatz,et al. Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[35] Hilde van der Togt,et al. Publisher's Note , 2003, J. Netw. Comput. Appl..

[36] Daniel S. Standage,et al. Kevlar: A Mapping-Free Framework for Accurate Discovery of De Novo Variants , 2019, bioRxiv.

[37] Carl Kingsford,et al. Improved Search of Large Transcriptomic Sequencing Databases Using Split Sequence Bloom Trees , 2016, bioRxiv.

[38] Phelim Bradley,et al. Ultra-fast search of all deposited bacterial and viral genomic data , 2019, Nature Biotechnology.

[39] Paola Bonizzoni,et al. MALVA: Genotyping by Mapping-free ALlele Detection of Known VAriants , 2019, iScience.

[40] Christina Boucher,et al. Data structures based on k-mers for querying large collections of sequencing data sets , 2019, bioRxiv.

[41] Gregory Kucherov,et al. Simplitigs as an efficient and scalable representation of de Bruijn graphs , 2020, Genome Biology.