论文信息 - Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters - 字舞流文

Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters

Abstract Using a sequence's k-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. As k-mer sets often reach hundreds of millions of elements, traditional data structures are often impractical for k-mer set storage, and Bloom filters (BFs) and their variants are used instead. BFs reduce the memory footprint required to store millions of k-mers while allowing for fast set containment queries, at the cost of a low false positive rate (FPR). We show that, because k-mers are derived from sequencing reads, the information about k-mer overlap in the original sequence can be used to reduce the FPR up to 30 × with little or no additional memory and with set containment queries that are only 1.3 – 1.6 times slower. Alternatively, we can leverage k-mer overlap information to store k-mer sets in about half the space while maintaining the original FPR. We consider several variants of such k-mer Bloom filters (kBFs), derive theoretical upper bounds for their FPR, and discuss their range of applications and limitations.

Darya Filippova | Carl Kingsford | David Pellow | Carl Kingsford | Darya Filippova | David Pellow

[1] Xiaolong Wu,et al. BLESS: Bloom filter-based error correction solution for high-throughput sequencing reads , 2014, Bioinform..

[2] Robert Patro,et al. Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms , 2013, ArXiv.

[3] Arend Hintze,et al. Scaling metagenome sequence assembly with probabilistic de Bruijn graphs , 2011, Proceedings of the National Academy of Sciences.

[4] B. Langmead,et al. Lighter: fast and memory-efficient sequencing error correction without counting , 2014, Genome Biology.

[5] Burton H. Bloom,et al. Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[6] Carl Kingsford,et al. Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees , 2015, bioRxiv.

[7] Gregory Kucherov,et al. Using cascading Bloom filters to improve the memory usage for de Brujin graphs , 2013, Algorithms for Molecular Biology.

[8] Eran Halperin,et al. Fast lossless compression via cascading Bloom filters , 2014, BMC Bioinformatics.

[9] Derrick E. Wood,et al. Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[10] Bryan O'Sullivan,et al. Using Bloom Filters for Large Scale Gene Sequence Analysis in Haskell , 2009, PADL.

[11] Carl Kingsford,et al. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[12] Björn Andersson,et al. Classification of DNA sequences using Bloom filters , 2010, Bioinform..

[13] Andrei Broder,et al. Network Applications of Bloom Filters: A Survey , 2004, Internet Math..

[14] Bonnie Berger,et al. Traversing the k-mer Landscape of NGS Read Datasets for Quality Score Sparsification , 2014, RECOMB.

[15] Darya Filippova,et al. Improving Bloom Filter Performance on Sequence Data Using k -mer Bloom Filters , 2016, RECOMB.

[16] Rob Patro,et al. Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms , 2013, Nature Biotechnology.

[17] E. Birney,et al. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[18] Weiguo Liu,et al. Accelerating error correction in high-throughput short-read DNA sequencing data with CUDA , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[19] Carl Kingsford,et al. Fast Search of Thousands of Short-Read Sequencing Experiments , 2015, Nature Biotechnology.

[20] Dominique Lavenier,et al. Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph , 2015, BMC Bioinformatics.

[21] Jens Stoye,et al. Bloom Filter Trie - A Data Structure for Pan-Genome Storage , 2015, WABI.

[22] Rayan Chikhi,et al. Space-efficient and exact de Bruijn graph representation based on a Bloom filter , 2012, Algorithms for Molecular Biology.