Squeakr: An Exact and Approximate k-mer Counting System

Motivation k-mer-based algorithms have become increasingly popular in the processing of high-throughput sequencing data. These algorithms span the gamut of the analysis pipeline from k-mer counting (e.g. for estimating assembly parameters), to error correction, genome and transcriptome assembly, and even transcript quantification. Yet, these tasks often use very different k-mer representations and data structures. In this article, we show how to build a k-mer-counting and multiset-representation system using the counting quotient filter, a feature-rich approximate membership query data structure. We introduce the k-mer-counting/querying system Squeakr (Simple Quotient filter-based Exact and Approximate Kmer Representation), which is based on the counting quotient filter. This off-the-shelf data structure turns out to be an efficient (approximate or exact) representation for sets or multisets of k-mers. Results Squeakr takes 2×-4.3× less time than the state-of-the-art to count and perform a random-point-query workload. Squeakr is memory-efficient, consuming 1.5×-4.3× less memory than the state-of-the-art. It offers competitive counting performance. In fact, it is faster for larger k-mers, and answers point queries (i.e. queries for the abundance of a particular k-mer) over an order-of-magnitude faster than other systems. The Squeakr representation of the k-mer multiset turns out to be immediately useful for downstream processing (e.g. de Bruijn graph traversal) because it supports fast queries and dynamic k-mer insertion, deletion, and modification. Availability and implementation https://github.com/splatlab/squeakr available under BSD 3-Clause License. Contact ppandey@cs.stonybrook.edu. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Leena Salmela,et al.  LoRDEC: accurate and efficient long read error correction , 2014, Bioinform..

[2]  Yongchao Liu,et al.  Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data , 2013, Bioinform..

[3]  Rong Li,et al.  Whole-genome analysis of 5-hydroxymethylcytosine and 5-methylcytosine at base resolution in the human brain , 2013, Genome Biology.

[4]  Alexander Schliep,et al.  Turtle: Identifying frequent k-mers with cache-efficient algorithms , 2013, Bioinform..

[5]  Gabriel Goldstein,et al.  Improved assembly of noisy long reads by k-mer validation , 2016, bioRxiv.

[6]  KingsfordCarl,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011 .

[7]  Michael A. Bender,et al.  Don't Thrash: How to Cache Your Hash on Flash , 2011, Proc. VLDB Endow..

[8]  Anne Bergeron,et al.  The evolution of the tape measure protein: units, duplications and losses , 2011, BMC Bioinformatics.

[9]  Shigang Chen,et al.  Fast Bloom Filters and Their Generalization , 2014, IEEE Transactions on Parallel and Distributed Systems.

[10]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[11]  Shuai Cheng Li,et al.  The difficulty of protein structure alignment under the RMSD , 2013, Algorithms for Molecular Biology.

[12]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Dominique Lavenier,et al.  DSK: k-mer counting with very low memory usage , 2013, Bioinform..

[14]  Cheng Soon Ong,et al.  kWIP: The k-mer weighted inner product, a de novo estimator of genetic similarity , 2016, bioRxiv.

[15]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[16]  B. Langmead,et al.  Lighter: fast and memory-efficient sequencing error correction without counting , 2014, Genome Biology.

[17]  Robert Patro,et al.  Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms , 2013, ArXiv.

[18]  Páll Melsted,et al.  Efficient counting of k-mers in DNA sequences using a bloom filter , 2011, BMC Bioinformatics.

[19]  Michael A. Bender,et al.  A General-Purpose Counting Filter: Making Every Bit Count , 2017, SIGMOD Conference.

[20]  Rob Patro,et al.  Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms , 2013, Nature Biotechnology.

[21]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[22]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[23]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[24]  Wei Wang,et al.  RNA-Skim: a rapid method for RNA-Seq quantification at transcript level , 2014, Bioinform..

[25]  Martin Vingron,et al.  Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels , 2012, Bioinform..

[26]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[27]  Michael Roberts,et al.  Reducing storage requirements for biological sequence comparison , 2004, Bioinform..

[28]  Rayan Chikhi,et al.  Space-efficient and exact de Bruijn graph representation based on a Bloom filter , 2012, Algorithms for Molecular Biology.

[29]  Esko Ukkonen,et al.  Accurate selfcorrection of errors in long reads using de Bruijn graphs , 2016 .

[30]  David Hutchison,et al.  Scalable Bloom Filters , 2007, Inf. Process. Lett..

[31]  Tim H. Brom,et al.  A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data , 2012, 1203.4802.

[32]  Qingpeng Zhang,et al.  These Are Not the K-mers You Are Looking For: Efficient Online K-mer Counting Using a Probabilistic Data Structure , 2013, PloS one.

[33]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[34]  Peter Sanders,et al.  Cache-, hash-, and space-efficient bloom filters , 2009, JEAL.

[35]  Umberto Ferraro Petrillo,et al.  Alignment-Free Sequence Comparison over Hadoop for Computational Biology , 2015, 2015 44th International Conference on Parallel Processing Workshops.

[36]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[37]  Sebastian Deorowicz,et al.  KMC 2: Fast and resource-frugal k-mer counting , 2014, Bioinform..

[38]  Arend Hintze,et al.  Scaling metagenome sequence assembly with probabilistic de Bruijn graphs , 2011, Proceedings of the National Academy of Sciences.

[39]  Stephen M. Mount,et al.  Insights from GWAS: emerging landscape of mechanisms underlying complex trait disease , 2015, BMC Genomics.

[40]  Brian D. Ondov,et al.  Mash: fast genome and metagenome distance estimation using MinHash , 2015, Genome Biology.

[41]  Li Fan,et al.  Summary cache: a scalable wide-area web cache sharing protocol , 2000, TNET.

[42]  Carl Kingsford,et al.  Fast Search of Thousands of Short-Read Sequencing Experiments , 2015, Nature Biotechnology.

[43]  J. Landolin,et al.  Assembling large genomes with single-molecule sequencing and locality-sensitive hashing , 2014, Nature Biotechnology.

[44]  S. Lonardi,et al.  CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers , 2015, BMC Genomics.

[45]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[46]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..