HmSearch: an efficient hamming distance query processing algorithm

Hamming distance measures the number of dimensions where two vectors have different values. In applications such as pattern recognition, information retrieval, and databases, we often need to efficiently process Hamming distance query, which retrieves vectors in a database that have no more than k Hamming distance from a given query vector. Existing work on efficient Hamming distance query processing has some of the following limitations, such as only applicable to tiny error threshold values, unable to deal with vectors where the value domain is large, or unable to attain robust performance in the presence of data skew. In this paper, we propose HmSearch, an efficient query processing method for Hamming distance queries that addresses the above-mentioned limitations. Our method is based on improved enumeration-based signatures, enhanced filtering, and the hierarchical binary filtering-and-verification. We also design an effective dimension rearrangement method to deal with data skew. Extensive experimental results demonstrate that our methods outperform state-of-the-art methods by up to two orders of magnitude.

[1]  Gerth Stølting Brodal,et al.  Improved Bounds for Dictionary Look-up with One Error , 1999 .

[2]  Andreas Paepcke,et al.  SpotSigs: robust and efficient near duplicate detection in large web collections , 2008, SIGIR '08.

[3]  Pierre Baldi,et al.  Large scale study of multiple-molecule queries , 2009, J. Cheminformatics.

[4]  Pierre Baldi,et al.  Hashing Algorithms and Data Structures for Rapid Searches of Fingerprint Vectors , 2010, J. Chem. Inf. Model..

[5]  Wilfred Ng,et al.  Locality-sensitive hashing scheme based on dynamic collision counting , 2012, SIGMOD Conference.

[6]  Leszek Gasieniec,et al.  Approximate Dictionary Queries , 1996, CPM.

[7]  Eric Torng,et al.  Large scale Hamming distance query processing , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[8]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[9]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[10]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[11]  Darren R. Flower,et al.  On the Properties of Bit String-Based Measures of Chemical Similarity , 1998, J. Chem. Inf. Comput. Sci..

[12]  Andrew Chi-Chih Yao,et al.  Dictionary Look-Up with One Error , 1997, J. Algorithms.

[13]  Pierre Baldi,et al.  ChemDB: a public database of small molecules and related chemoinformatics resources , 2005, Bioinform..

[14]  Marvin Minsky,et al.  Perceptrons: An Introduction to Computational Geometry , 1969 .

[15]  Yasuo Tabei,et al.  Single versus Multiple Sorting in All Pairs Similarity Search , 2010, ACML.

[16]  Udi Manber,et al.  An Algorithm for Approximate Membership checking with Application to Password Security , 1994, Inf. Process. Lett..

[17]  Bin Chen,et al.  PubChem as a Source of Polypharmacology , 2009, J. Chem. Inf. Model..

[18]  David J. Fleet,et al.  Fast search in Hamming space with multi-index hashing , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[20]  Pierre Baldi,et al.  Tree and Hashing Data Structures to Speed up Chemical Searches: Analysis and Experiments , 2011, Molecular informatics.

[21]  Pierre Baldi,et al.  Speeding Up Chemical Searches Using the Inverted Index: The Convergence of Chemoinformatics and Text Search Methods , 2012, J. Chem. Inf. Model..

[22]  Pierre Baldi,et al.  Bounds and Algorithms for Fast Exact Searches of Chemical Fingerprints in Linear and Sublinear Time , 2007, J. Chem. Inf. Model..

[23]  Pierre Baldi,et al.  Speeding Up Chemical Database Searches Using a Proximity Filter Based on the Logical Exclusive OR , 2008, J. Chem. Inf. Model..

[24]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[25]  Jiaheng Lu,et al.  Efficient Merging and Filtering Algorithms for Approximate String Searches , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[26]  Richard Cole,et al.  Dictionary matching and indexing with errors and don't cares , 2004, STOC '04.