论文信息 - Squeakr: An Exact and Approximate k-mer Counting System

Squeakr: An Exact and Approximate k-mer Counting System

Motivation k-mer-based algorithms have become increasingly popular in the processing of high-throughput sequencing data. These algorithms span the gamut of the analysis pipeline from k-mer counting (e.g. for estimating assembly parameters), to error correction, genome and transcriptome assembly, and even transcript quantification. Yet, these tasks often use very different k-mer representations and data structures. In this article, we show how to build a k-mer-counting and multiset-representation system using the counting quotient filter, a feature-rich approximate membership query data structure. We introduce the k-mer-counting/querying system Squeakr (Simple Quotient filter-based Exact and Approximate Kmer Representation), which is based on the counting quotient filter. This off-the-shelf data structure turns out to be an efficient (approximate or exact) representation for sets or multisets of k-mers. Results Squeakr takes 2×-4.3× less time than the state-of-the-art to count and perform a random-point-query workload. Squeakr is memory-efficient, consuming 1.5×-4.3× less memory than the state-of-the-art. It offers competitive counting performance. In fact, it is faster for larger k-mers, and answers point queries (i.e. queries for the abundance of a particular k-mer) over an order-of-magnitude faster than other systems. The Squeakr representation of the k-mer multiset turns out to be immediately useful for downstream processing (e.g. de Bruijn graph traversal) because it supports fast queries and dynamic k-mer insertion, deletion, and modification. Availability and implementation https://github.com/splatlab/squeakr available under BSD 3-Clause License. Contact ppandey@cs.stonybrook.edu. Supplementary information Supplementary data are available at Bioinformatics online.

[1] Leena Salmela,et al. LoRDEC: accurate and efficient long read error correction , 2014, Bioinform..

[2] Yongchao Liu,et al. Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data , 2013, Bioinform..

[3] Rong Li,et al. Whole-genome analysis of 5-hydroxymethylcytosine and 5-methylcytosine at base resolution in the human brain , 2013, Genome Biology.

[4] Alexander Schliep,et al. Turtle: Identifying frequent k-mers with cache-efficient algorithms , 2013, Bioinform..

[5] Gabriel Goldstein,et al. Improved assembly of noisy long reads by k-mer validation , 2016, bioRxiv.

[6] KingsfordCarl,et al. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011 .

[7] Michael A. Bender,et al. Don't Thrash: How to Cache Your Hash on Flash , 2011, Proc. VLDB Endow..

[8] Anne Bergeron,et al. The evolution of the tape measure protein: units, duplications and losses , 2011, BMC Bioinformatics.

[9] Shigang Chen,et al. Fast Bloom Filters and Their Generalization , 2014, IEEE Transactions on Parallel and Distributed Systems.

[10] Burton H. Bloom,et al. Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[11] Shuai Cheng Li,et al. The difficulty of protein structure alignment under the RMSD , 2013, Algorithms for Molecular Biology.

[12] P. Pevzner,et al. An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[13] Dominique Lavenier,et al. DSK: k-mer counting with very low memory usage , 2013, Bioinform..

[14] Cheng Soon Ong,et al. kWIP: The k-mer weighted inner product, a de novo estimator of genetic similarity , 2016, bioRxiv.

[15] Sergey I. Nikolenko,et al. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[16] B. Langmead,et al. Lighter: fast and memory-efficient sequencing error correction without counting , 2014, Genome Biology.

[17] Robert Patro,et al. Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms , 2013, ArXiv.

[18] Páll Melsted,et al. Efficient counting of k-mers in DNA sequences using a bloom filter , 2011, BMC Bioinformatics.

[19] Michael A. Bender,et al. A General-Purpose Counting Filter: Making Every Bit Count , 2017, SIGMOD Conference.

[20] Rob Patro,et al. Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms , 2013, Nature Biotechnology.

[21] E. Birney,et al. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[22] N. Friedman,et al. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[23] Steven J. M. Jones,et al. Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[24] Wei Wang,et al. RNA-Skim: a rapid method for RNA-Seq quantification at transcript level , 2014, Bioinform..