IMPROVING MIN HASH VIA THE CONTAINMENT INDEX WITH APPLICATIONS TO METAGENOMIC ANALYSIS

Min hash is a probabilistic method for estimating the similarity of two sets in terms of their Jaccard index, defined as the ration of the size of their intersection to their union. We demonstrate that this method performs best when the sets under consideration are of similar size and the performance degrades considerably when the sets are of very different size. We introduce a new and efficient approach, called the containment min hash approach, that is more suitable for estimating the Jaccard index of sets of very different size. We accomplish this by leveraging another probabilistic method (in particular, Bloom filters) for fast membership queries. We derive bounds on the probability of estimate errors for the containment min hash approach and show it significantly improves upon the classical min hash approach. We also show significant improvements in terms of time and space complexity. As an application, we use this method to detect the presence/absence of organisms in a metagenomic data set, showing that it can detect the presence of very small, low abundance microorganisms.

[1]  Michael A. Bender,et al.  Don't Thrash: How to Cache Your Hash on Flash , 2011, Proc. VLDB Endow..

[2]  Abhinandan Das,et al.  Google news personalization: scalable online collaborative filtering , 2007, WWW '07.

[3]  Xiaolong Wu,et al.  BLESS: Bloom filter-based error correction solution for high-throughput sequencing reads , 2014, Bioinform..

[4]  Richard M. Karp,et al.  Faster and More Accurate Sequence Alignment with SNAP , 2011, ArXiv.

[5]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[6]  William Pugh,et al.  Skip Lists: A Probabilistic Alternative to Balanced Trees , 1989, WADS.

[7]  Páll Melsted,et al.  Efficient counting of k-mers in DNA sequences using a bloom filter , 2011, BMC Bioinformatics.

[8]  Edith Cohen,et al.  Finding interesting associations without support pruning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[9]  Eli Upfal,et al.  Probability and Computing: Randomized Algorithms and Probabilistic Analysis , 2005 .

[10]  P. Flajolet,et al.  HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm , 2007 .

[11]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[12]  Arend Hintze,et al.  Scaling metagenome sequence assembly with probabilistic de Bruijn graphs , 2011, Proceedings of the National Academy of Sciences.

[13]  Sven Kosub,et al.  A note on the triangle inequality for the Jaccard distance , 2016, Pattern Recognit. Lett..

[14]  Carl Kingsford,et al.  Fast Search of Thousands of Short-Read Sequencing Experiments , 2015, Nature Biotechnology.

[15]  T. Thomas,et al.  GemSIM: general, error-model based simulator of next-generation sequencing data , 2012, BMC Genomics.

[16]  Bin Fan,et al.  Cuckoo Filter: Practically Better Than Bloom , 2014, CoNEXT.

[17]  Rayan Chikhi,et al.  Space-efficient and exact de Bruijn graph representation based on a Bloom filter , 2012, Algorithms for Molecular Biology.

[18]  Björn Andersson,et al.  Classification of DNA sequences using Bloom filters , 2010, Bioinform..

[19]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[20]  Justin Chu,et al.  BioBloom tools: fast, accurate and memory-efficient host species sequence screening using bloom filters , 2014, Bioinform..

[21]  Brian D. Ondov,et al.  Mash: fast genome and metagenome distance estimation using MinHash , 2015, Genome Biology.

[22]  Evan Bolton,et al.  Database resources of the National Center for Biotechnology Information , 2017, Nucleic Acids Res..

[23]  Luiz Irber,et al.  sourmash: a library for MinHash sketching of DNA , 2016, J. Open Source Softw..

[24]  S. Tringe,et al.  Tackling soil diversity with the assembly of large, complex metagenomes , 2014, Proceedings of the National Academy of Sciences.