KmerEstimate: A Streaming Algorithm for Estimating k-mer Counts with Optimal Space Usage

The frequency distribution of k-mers (substrings of length k in a DNA/RNA sequence) is very useful for many bioinformatics applications that use next-generation sequencing (NGS) data. Some examples of these include de Bruijn graph based assembly, read error correction, genome size prediction, and digital normalization. In developing tools for such applications, counting (or estimating) k-mers with low frequency is a pre-processing phase. However, computing k-mer frequency histogram becomes computationally challenging for large-scale genomic data. We present KmerEstimate, a \em streaming algorithm that approximates the count of k-mers with a given frequency in a genomic data set. Our algorithm is based on a well known adaptive sampling based streaming algorithm due to Bar-Yossef et al. for approximating distinct elements in a data stream. We implemented and tested our algorithm on several data sets. The results of our algorithm are better than that of other streaming approaches used so far for this problem (notably $ntCard$, the state-of-the-art streaming approach) and is within 0.6% error rate. It uses less memory than $ntCard$ as the sample size is almost 85% less than that of $ntCard$. In addition, our algorithm has provable approximation and space usage guarantees. We also show certain space complexity lower bounds. The source code of our algorithm is available at \urlhttps://github.com/srbehera11/KmerEstimate. We present KmerEstimate, a \em streaming algorithm that approximates the count of k-mers with a given frequency in a genomic data set. Our algorithm is based on a well known adaptive sampling based streaming algorithm due to Bar-Yossef et al. for approximating distinct elements in a data stream. We implemented and tested our algorithm on several data sets. The results of our algorithm are better than that of other streaming approaches used so far for this problem (notably $ntCard$, the state-of-the-art streaming approach) and are within 0.6% error rate. It uses less memory than $ntCard$ as the sample size is almost 85% less than that of $ntCard$. In addition, our algorithm has provable approximation and space usage guarantees. We also show certain space complexity lower bounds. The source code of our algorithm is available at \urlhttps://github.com/srbehera11/KmerEstimate.

[1]  Sebastian Deorowicz,et al.  KMC 2: Fast and resource-frugal k-mer counting , 2014, Bioinform..

[2]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[3]  Hamid Mohamadi,et al.  ntCard: a streaming algorithm for cardinality estimation in genomics data , 2017, Bioinform..

[4]  Tomás Vinar,et al.  How Big is that Genome? Estimating Genome Size and Coverage from k-mer Abundance Spectra , 2015, SPIRE.

[5]  Graham Cormode,et al.  Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling , 2005, VLDB.

[6]  Soyeon Cha,et al.  Optimizing k-mer size using a variant grid search to enhance de novo genome assembly , 2016, Bioinformation.

[7]  Justin Chu,et al.  ntHash: recursive nucleotide hashing , 2016, Bioinform..

[8]  A. Gnirke,et al.  High-quality draft assemblies of mammalian genomes from massively parallel sequence data , 2010, Proceedings of the National Academy of Sciences.

[9]  Arend Hintze,et al.  Scaling metagenome sequence assembly with probabilistic de Bruijn graphs , 2011, Proceedings of the National Academy of Sciences.

[10]  Chaoyang Zhang,et al.  A comparative study of k-spectrum-based error correction methods for next-generation sequencing data analysis , 2016, Human Genomics.

[11]  Matthias Müller-Hannemann,et al.  Gerbil: a fast and memory-efficient k-mer counter with GPU-support , 2016, Algorithms for Molecular Biology.

[12]  Life Technologies,et al.  A map of human genome variation from population-scale sequencing , 2011 .

[13]  Yang Li,et al.  MSPKmerCounter: A Fast and Memory Efficient Approach for K-mer Counting , 2015, ArXiv.

[14]  Paul Medvedev,et al.  Informed and automated k-mer size selection for genome assembly , 2013, Bioinform..

[15]  Amit Chakrabarti,et al.  An Optimal Lower Bound on the Communication Complexity of Gap-Hamming-Distance , 2012, SIAM J. Comput..

[16]  B. Chor,et al.  Genomic DNA k-mer spectra: models and modalities , 2009, Genome Biology.

[17]  Qingpeng Zhang,et al.  These Are Not the K-mers You Are Looking For: Efficient Online K-mer Counting Using a Probabilistic Data Structure , 2013, PloS one.

[18]  Pavel A Pevzner,et al.  How to apply de Bruijn graphs to genome assembly. , 2011, Nature biotechnology.

[19]  Ravi Kumar,et al.  The One-Way Communication Complexity of Hamming Distance , 2008, Theory Comput..

[20]  S. Kurtz,et al.  A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes , 2008, BMC Genomics.

[21]  Luca Trevisan,et al.  Counting Distinct Elements in a Data Stream , 2002, RANDOM.

[22]  Sebastian Deorowicz,et al.  KMC 3: counting and manipulating k‐mer statistics , 2017, Bioinform..

[23]  Huanming Yang,et al.  De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[24]  Alexander Schliep,et al.  Turtle: Identifying frequent k-mers with cache-efficient algorithms , 2013, Bioinform..

[25]  Noam Nisan,et al.  On Randomized One-round Communication Complexity , 1995, STOC '95.

[26]  Gabriel Goldstein,et al.  Improved assembly of noisy long reads by k-mer validation , 2016, bioRxiv.

[27]  Eyal Kushilevitz,et al.  Communication Complexity: Index of Notation , 1996 .

[28]  A. Razborov Communication Complexity , 2011 .

[29]  Dominique Lavenier,et al.  DSK: k-mer counting with very low memory usage , 2013, Bioinform..

[30]  Alexa B. R. McIntyre,et al.  Extensive sequencing of seven human genomes to characterize benchmark reference materials , 2015, Scientific Data.

[31]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[32]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[33]  Kateryna D. Makova,et al.  RecoverY: K-mer based read classification for Y-chromosome specific sequencing and assembly , 2017, bioRxiv.

[34]  David P. Woodruff Optimal space lower bounds for all frequency moments , 2004, SODA '04.

[35]  Páll Melsted,et al.  KmerStream: Streaming algorithms for k-mer abundance estimation , 2014, bioRxiv.

[36]  Fredrik Vannberg,et al.  KAnalyze: a fast versatile pipelined K-mer toolkit , 2014, Bioinform..

[37]  David P. Woodruff,et al.  An optimal algorithm for the distinct elements problem , 2010, PODS '10.

[38]  Sanguthevar Rajasekaran,et al.  KCMBT: a k-mer Counter based on Multiple Burst Trees , 2016, Bioinform..

[39]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[40]  Naveen Sivadasan,et al.  Kmerlight: fast and accurate k-mer abundance estimation , 2016, ArXiv.

[41]  Jianying Yuan,et al.  Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects , 2013, 1308.2012.

[42]  Michael A. Bender,et al.  Squeakr: An Exact and Approximate k-mer Counting System , 2017, bioRxiv.

[43]  Srinivas Aluru,et al.  A survey of error-correction methods for next-generation sequencing , 2013, Briefings Bioinform..

[44]  Páll Melsted,et al.  Efficient counting of k-mers in DNA sequences using a bloom filter , 2011, BMC Bioinformatics.

[45]  Junjie Fu,et al.  Unbiased K-mer Analysis Reveals Changes in Copy Number of Highly Repetitive Sequences During Maize Domestication and Improvement , 2016, Scientific Reports.

[46]  Mário Lipovský,et al.  Approximate Abundance Histograms and Their Use for Genome Size Estimation , 2017, ITAT.