论文信息 - kmcEx: memory-frugal and retrieval-efficient encoding of counted k-mers - 字舞流文

kmcEx: memory-frugal and retrieval-efficient encoding of counted k-mers

MOTIVATION K-mers along with their frequency have served as an elementary building block for error correction, repeat detection, multiple sequence alignment, genome assembly, etc., attracting intensive studies in k-mer counting. However, the output of k-mer counters itself is large; very often, it is too large to fit into main memory, leading to highly narrowed usability. METHODS AND RESULTS We introduce a novel idea of encoding k-mers as well as their frequency, achieving good memory saving and retrieval efficiency. Specifically, we propose a Bloom Filter-like data structure to encode counted k-mers by coupled-bit arrays-one for k-mer representation and the other for frequency encoding. Experiments on five real data sets show that the average memory-saving ratio on all 31-mers is as high as 13.81 as compared with raw input, with 7 hash functions. At the same time, the retrieval time complexity is well controlled (effectively constant), and the false-positive rate is decreased by two orders of magnitude. AVAILABILITY The source codes of our algorithm are available at github.com/lzhLab/kmcEx.

Peng Jiang | Liang Zhao | Limsoon Wong | Bertil Schmidt | Jie Luo | Ningjiang Chen | Xiangjun Tang | Yiqi Wang | Pingji Deng

[1] Arend Hintze,et al. Scaling metagenome sequence assembly with probabilistic de Bruijn graphs , 2011, Proceedings of the National Academy of Sciences.

[2] Sanguthevar Rajasekaran,et al. KCMBT: a k-mer Counter based on Multiple Burst Trees , 2016, Bioinform..

[3] Kunihiko Sadakane,et al. Succinct de Bruijn Graphs , 2012, WABI.

[4] Michael A. Bender,et al. deBGR: an efficient and near-exact representation of the weighted de Bruijn graph , 2017, Bioinform..

[5] Burton H. Bloom,et al. Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[6] Pavel A Pevzner,et al. How to apply de Bruijn graphs to genome assembly. , 2011, Nature biotechnology.

[7] Rayan Chikhi,et al. Space-efficient and exact de Bruijn graph representation based on a Bloom filter , 2012, Algorithms for Molecular Biology.

[8] Gregory Kucherov,et al. Using cascading Bloom filters to improve the memory usage for de Brujin graphs , 2013, Algorithms for Molecular Biology.

[9] Dominique Lavenier,et al. DSK: k-mer counting with very low memory usage , 2013, Bioinform..

[10] Hani Z. Girgis,et al. MeShClust: an intelligent tool for clustering DNA sequences , 2017, bioRxiv.

[11] John D McPherson,et al. Next-generation gap , 2009, Nature Methods.

[12] Daniel Standage,et al. The khmer software package: enabling efficient nucleotide sequence analysis , 2015, F1000Research.

[13] Carl Kingsford,et al. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[14] Páll Melsted,et al. Efficient counting of k-mers in DNA sequences using a bloom filter , 2011, BMC Bioinformatics.

[15] Darya Filippova,et al. Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters , 2017, J. Comput. Biol..

[16] Peng Jiang,et al. MapReduce for accurate error correction of next-generation sequencing data , 2017, Bioinform..

[17] Pierre Baldi,et al. Mathematical Correction for Fingerprint Similarity Measures to Improve Chemical Retrieval , 2007, J. Chem. Inf. Model..

[18] Sebastian Deorowicz,et al. KMC 3: counting and manipulating k‐mer statistics , 2017, Bioinform..

[19] D. Schaid,et al. From genome-wide associations to candidate causal variants by statistical fine-mapping , 2018, Nature Reviews Genetics.

[20] Armando J. Pinho,et al. A Survey on Data Compression Methods for Biological Sequences , 2016, Inf..

[21] M. Schatz,et al. Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[22] Rita Casadio,et al. Algorithms in Bioinformatics, 5th International Workshop, WABI 2005, Mallorca, Spain, October 3-6, 2005, Proceedings , 2005, WABI.

[23] E. Birney,et al. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.