AllSome Sequence Bloom Trees

The ubiquity of next generation sequencing has transformed the size and nature of many databases, pushing the boundaries of current indexing and searching methods. One particular example is a database of 2,652 human RNA-seq experiments uploaded to the Sequence Read Archive. Recently, Solomon and Kingsford proposed the Sequence Bloom Tree data structure and demonstrated how it can be used to accurately identify SRA samples that have a transcript of interest potentially expressed. In this paper, we propose an improvement called the AllSome Sequence Bloom Tree. Results show that our new data structure significantly improves performance, reducing the tree construction time by 52.7% and query time by 39–85%, with a price of up to 3x memory consumption during queries. Notably, it can query a batch of 198,074 queries in under 8 h (compared to around two days previously) and a whole set of \(k\)-mers from a sequencing experiment (about 27 mil \(k\)-mers) in under 11 min.

[1]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[2]  Ricardo A. Baeza-Yates,et al.  Compression: A Key for Next-Generation Text Retrieval Systems , 2000, Computer.

[3]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees and multisets , 2002, SODA '02.

[4]  Ricardo A. Baeza-Yates,et al.  Adding Compression to Block Addressing Inverted Indexes , 2000, Information Retrieval.

[5]  Satoru Miyano,et al.  Open source clustering software , 2004 .

[6]  Björn Andersson,et al.  Classification of DNA sequences using Bloom filters , 2010, Bioinform..

[7]  Páll Melsted,et al.  Efficient counting of k-mers in DNA sequences using a bloom filter , 2011, BMC Bioinformatics.

[8]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[9]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[10]  David R. Kelley,et al.  Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks , 2012, Nature Protocols.

[11]  Rayan Chikhi,et al.  Space-efficient and exact de Bruijn graph representation based on a Bloom filter , 2012, Algorithms for Molecular Biology.

[12]  Robert Patro,et al.  Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms , 2013, ArXiv.

[13]  Sven Rahmann,et al.  PanCake: A Data Structure for Pangenomes , 2013, GCB.

[14]  Gregory Kucherov,et al.  Using Cascading Bloom Filters to Improve the Memory Usage for de Brujin Graphs , 2013, WABI.

[15]  Xiaolong Wu,et al.  BLESS: Bloom filter-based error correction solution for high-throughput sequencing reads , 2014, Bioinform..

[16]  Rob Patro,et al.  Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms , 2013, Nature Biotechnology.

[17]  Michael C. Schatz,et al.  SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips , 2014, Bioinform..

[18]  Alistair Moffat,et al.  From Theory to Practice: Plug and Play with Succinct Data Structures , 2013, SEA.

[19]  Daniel Lemire,et al.  Bloofi: Multidimensional Bloom filters , 2015, Inf. Syst..

[20]  Jens Stoye,et al.  Bloom Filter Trie - A Data Structure for Pan-Genome Storage , 2015, WABI.

[21]  Bonnie Berger,et al.  Entropy-scaling search of massive biological data , 2015, Cell systems.

[22]  Owen Kaser,et al.  Better bitmap performance with Roaring bitmaps , 2014, Softw. Pract. Exp..

[23]  Enno Ohlebusch,et al.  Graphical pan-genome analysis with compressed suffix trees and the Burrows-Wheeler transform , 2016, Bioinform..

[24]  Yadong Wang,et al.  deBWT: parallel construction of Burrows–Wheeler Transform for large collection of genomes with de Bruijn-branch encoding , 2016, Bioinform..

[25]  Cheng Soon Ong,et al.  kWIP: The k-mer Weighted Inner Product, a de novo Estimator of Genetic Similarity , 2016 .

[26]  Carl Kingsford,et al.  Fast Search of Thousands of Short-Read Sequencing Experiments , 2015, Nature Biotechnology.

[27]  Ying Zhang,et al.  Computational pan-genomics: status, promises and challenges , 2016, bioRxiv.

[28]  Lior Pachter,et al.  Near-optimal probabilistic RNA-seq quantification , 2016, Nature Biotechnology.

[29]  Using reference-free compressed data structures to analyze sequencing reads from thousands of human genomes. , 2017, Genome research.

[30]  Leonardo Collado-Torres,et al.  Rail-RNA: Scalable analysis of RNA-seq splicing and coverage , 2015, bioRxiv.

[31]  Paul Medvedev,et al.  TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes , 2016, Bioinform..