AllSome Sequence Bloom Trees

The ubiquity of next generation sequencing has transformed the size and nature of many databases, pushing the boundaries of current indexing and searching methods. One particular example is a database of 2,652 human RNA-seq experiments uploaded to the Sequence Read Archive. Recently, Solomon and Kingsford proposed the Sequence Bloom Tree data structure and demonstrated how it can be used to accurately identify SRA samples that have a transcript of interest potentially expressed. In this paper, we propose an improvement called the AllSome Sequence Bloom Tree. Results show that our new data structure significantly improves performance, reducing the tree construction time by 52.7% and query time by 39 - 85%, with a price of up to 3x memory consumption during queries. Notably, it can query a batch of 198,074 queries in under 8 hours (compared to around two days previously) and a whole set of k-mers from a sequencing experiment (about 27 mil k-mers) in under 11 minutes.

[1]  Gregory Kucherov,et al.  Using cascading Bloom filters to improve the memory usage for de Brujin graphs , 2013, Algorithms for Molecular Biology.

[2]  Michael Mitzenmacher,et al.  Less hashing, same performance: Building a better Bloom filter , 2006, Random Struct. Algorithms.

[3]  Rob Patro,et al.  Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms , 2013, Nature Biotechnology.

[4]  Leonardo Collado-Torres,et al.  Rail-RNA: Scalable analysis of RNA-seq splicing and coverage , 2015, bioRxiv.

[5]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[6]  Zamin Iqbal,et al.  Using reference-free compressed data structures to analyze sequencing reads from thousands of human genomes , 2016, bioRxiv.

[7]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[8]  Páll Melsted,et al.  Efficient counting of k-mers in DNA sequences using a bloom filter , 2011, BMC Bioinformatics.

[9]  B. Berger,et al.  Compressive genomics , 2012, Nature Biotechnology.

[10]  Ricardo A. Baeza-Yates,et al.  Adding Compression to Block Addressing Inverted Indexes , 2000, Information Retrieval.

[11]  Chen Sun,et al.  AllSome Sequence Bloom Trees , 2018, J. Comput. Biol..

[12]  Jens Stoye,et al.  Bloom Filter Trie - A Data Structure for Pan-Genome Storage , 2015, WABI.

[13]  Neil D. Rawlings,et al.  New mini- zincin structures provide a minimal scaffold for members of this metallopeptidase superfamily , 2014, BMC Bioinformatics.

[14]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[15]  Rayan Chikhi,et al.  Space-efficient and exact de Bruijn graph representation based on a Bloom filter , 2012, Algorithms for Molecular Biology.

[16]  Lior Pachter,et al.  Near-optimal probabilistic RNA-seq quantification , 2016, Nature Biotechnology.

[17]  Cheng Soon Ong,et al.  kWIP: The k-mer weighted inner product, a de novo estimator of genetic similarity , 2016, bioRxiv.

[18]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees and multisets , 2002, SODA '02.

[19]  Eran Halperin,et al.  Fast lossless compression via cascading Bloom filters , 2014, BMC Bioinformatics.

[20]  Pierre Peterlongo,et al.  A Resource-frugal Probabilistic Dictionary and Applications in (Meta)Genomics , 2016, Stringology.

[21]  Yadong Wang,et al.  deBWT: parallel construction of Burrows–Wheeler Transform for large collection of genomes with de Bruijn-branch encoding , 2016, Bioinform..

[22]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[23]  Enno Ohlebusch,et al.  Graphical pan-genome analysis with compressed suffix trees and the Burrows-Wheeler transform , 2016, Bioinform..

[24]  The Computational Pan-Genomics Consortium,et al.  Computational pan-genomics: status, promises and challenges , 2018, Briefings Bioinform..

[25]  Alistair Moffat,et al.  From Theory to Practice: Plug and Play with Succinct Data Structures , 2013, SEA.

[26]  Daniel Lemire,et al.  Bloofi: Multidimensional Bloom filters , 2015, Inf. Syst..

[27]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[28]  Ricardo A. Baeza-Yates,et al.  Compression: A Key for Next-Generation Text Retrieval Systems , 2000, Computer.

[29]  Xiaolong Wu,et al.  BLESS: Bloom filter-based error correction solution for high-throughput sequencing reads , 2014, Bioinform..

[30]  Björn Andersson,et al.  Classification of DNA sequences using Bloom filters , 2010, Bioinform..

[31]  S Miyano,et al.  Open source clustering software. , 2004, Bioinformatics.

[32]  Sven Rahmann,et al.  PanCake: A Data Structure for Pangenomes , 2013, GCB.

[33]  Michael C. Schatz,et al.  SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips , 2014, Bioinform..

[34]  Paul Medvedev,et al.  TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes , 2016, Bioinform..

[35]  Ying Zhang,et al.  Computational pan-genomics: status, promises and challenges , 2016, bioRxiv.

[36]  Robert Patro,et al.  Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms , 2013, ArXiv.

[37]  Carl Kingsford,et al.  Fast Search of Thousands of Short-Read Sequencing Experiments , 2015, Nature Biotechnology.

[38]  Owen Kaser,et al.  Better bitmap performance with Roaring bitmaps , 2014, Softw. Pract. Exp..

[39]  Bonnie Berger,et al.  Entropy-scaling search of massive biological data , 2015, Cell systems.

[40]  David R. Kelley,et al.  Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks , 2012, Nature Protocols.