kmcEx: memory-frugal and retrieval-efficient encoding of counted k-mers

MOTIVATION K-mers along with their frequency have served as an elementary building block for error correction, repeat detection, multiple sequence alignment, genome assembly, etc., attracting intensive studies in k-mer counting. However, the output of k-mer counters itself is large; very often, it is too large to fit into main memory, leading to highly narrowed usability. METHODS AND RESULTS We introduce a novel idea of encoding k-mers as well as their frequency, achieving good memory saving and retrieval efficiency. Specifically, we propose a Bloom Filter-like data structure to encode counted k-mers by coupled-bit arrays-one for k-mer representation and the other for frequency encoding. Experiments on five real data sets show that the average memory-saving ratio on all 31-mers is as high as 13.81 as compared with raw input, with 7 hash functions. At the same time, the retrieval time complexity is well controlled (effectively constant), and the false-positive rate is decreased by two orders of magnitude. AVAILABILITY The source codes of our algorithm are available at github.com/lzhLab/kmcEx.

[1]  Arend Hintze,et al.  Scaling metagenome sequence assembly with probabilistic de Bruijn graphs , 2011, Proceedings of the National Academy of Sciences.

[2]  Sanguthevar Rajasekaran,et al.  KCMBT: a k-mer Counter based on Multiple Burst Trees , 2016, Bioinform..

[3]  Kunihiko Sadakane,et al.  Succinct de Bruijn Graphs , 2012, WABI.

[4]  Michael A. Bender,et al.  deBGR: an efficient and near-exact representation of the weighted de Bruijn graph , 2017, Bioinform..

[5]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[6]  Pavel A Pevzner,et al.  How to apply de Bruijn graphs to genome assembly. , 2011, Nature biotechnology.

[7]  Rayan Chikhi,et al.  Space-efficient and exact de Bruijn graph representation based on a Bloom filter , 2012, Algorithms for Molecular Biology.

[8]  Gregory Kucherov,et al.  Using cascading Bloom filters to improve the memory usage for de Brujin graphs , 2013, Algorithms for Molecular Biology.

[9]  Dominique Lavenier,et al.  DSK: k-mer counting with very low memory usage , 2013, Bioinform..

[10]  Hani Z. Girgis,et al.  MeShClust: an intelligent tool for clustering DNA sequences , 2017, bioRxiv.

[11]  John D McPherson,et al.  Next-generation gap , 2009, Nature Methods.

[12]  Daniel Standage,et al.  The khmer software package: enabling efficient nucleotide sequence analysis , 2015, F1000Research.

[13]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[14]  Páll Melsted,et al.  Efficient counting of k-mers in DNA sequences using a bloom filter , 2011, BMC Bioinformatics.

[15]  Darya Filippova,et al.  Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters , 2017, J. Comput. Biol..

[16]  Peng Jiang,et al.  MapReduce for accurate error correction of next-generation sequencing data , 2017, Bioinform..

[17]  Pierre Baldi,et al.  Mathematical Correction for Fingerprint Similarity Measures to Improve Chemical Retrieval , 2007, J. Chem. Inf. Model..

[18]  Sebastian Deorowicz,et al.  KMC 3: counting and manipulating k‐mer statistics , 2017, Bioinform..

[19]  D. Schaid,et al.  From genome-wide associations to candidate causal variants by statistical fine-mapping , 2018, Nature Reviews Genetics.

[20]  Armando J. Pinho,et al.  A Survey on Data Compression Methods for Biological Sequences , 2016, Inf..

[21]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[22]  Rita Casadio,et al.  Algorithms in Bioinformatics, 5th International Workshop, WABI 2005, Mallorca, Spain, October 3-6, 2005, Proceedings , 2005, WABI.

[23]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.