Large scale Hamming distance query processing

Hamming distance has been widely used in many application domains, such as near-duplicate detection and pattern recognition. We study Hamming distance range query problems, where the goal is to find all strings in a database that are within a Hamming distance bound k from a query string. If k is fixed, we have a static Hamming distance range query problem. If k is part of the input, we have a dynamic Hamming distance range query problem. For the static problem, the prior art uses lots of memory due to its aggressive replication of the database. For the dynamic range query problem, as far as we know, there is no space and time efficient solution for arbitrary databases. In this paper, we first propose a static Hamming distance range query algorithm called HEngines, which addresses the space issue in prior art by dynamically expanding the query on the fly. We then propose a dynamic Hamming distance range query algorithm called HEngined, which addresses the limitation in prior art using a divide-and-conquer strategy. We implemented our algorithms and conducted side-by-side comparisons on large real-world and synthetic datasets. In our experiments, HEngines uses 4.65 times less space and processes queries 16% faster than the prior art, and HEngined processes queries 46 times faster than linear scan while using only 1.7 times more space.

[1]  Abdullah N. Arslan Efficient approximate dictionary look-up over small alphabets , 2005 .

[2]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[3]  Ömer Egecioglu,et al.  Dictionary Look-Up within Small Edit Distance , 2002, COCOON.

[4]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[5]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[6]  Hanan Samet,et al.  A Fast Similarity Join Algorithm Using Graphics Processing Units , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[7]  Ingemar J. Cox,et al.  Audio fingerprinting: nearest neighbor search in high dimensional binary spaces , 2002, 2002 IEEE Workshop on Multimedia Signal Processing..

[8]  Alexandr Andoni,et al.  The Computational Hardness of Estimating Edit Distance , 2010 .

[9]  Abdullah N. Arslan Efficient Approximate Dictionary Look-Up for Long Words over Small Alphabets , 2006, LATIN.

[10]  Rafail Ostrovsky,et al.  Efficient search for approximate nearest neighbor in high dimensional spaces , 1998, STOC '98.

[11]  Wang Xiaotong,et al.  Neighborhood Limited Empirical Mode Decomposition and Application in Image Processing , 2007, Fourth International Conference on Image and Graphics (ICIG 2007).

[12]  Noam Nisan,et al.  Neighborhood preserving hashing and approximate queries , 1994, SODA '94.

[13]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[14]  Andrew Chi-Chih Yao,et al.  Dictionary Look-Up with One Error , 1997, J. Algorithms.

[15]  Frédéric Truchetet,et al.  Image retrieval with binary hamming distance , 2007, VISAPP.

[16]  Gerth Stølting Brodal,et al.  Improved Bounds for Dictionary Look-up with One Error , 1999 .

[17]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[18]  Robert Krauthgamer,et al.  Approximating edit distance efficiently , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[19]  Kai Li,et al.  Image similarity search with compact data structures , 2004, CIKM '04.

[20]  Zhe Wang,et al.  Ferret: a toolkit for content-based similarity search of feature-rich data , 2006, EuroSys.

[21]  Sakti Pramanik,et al.  The ND-Tree: A Dynamic Indexing Technique for Multidimensional Non-ordered Discrete Data Spaces , 2003, VLDB.

[22]  Leszek Gasieniec,et al.  Approximate Dictionary Queries , 1996, CPM.

[23]  Zhe Wang,et al.  Sizing sketches: a rank-based analysis for similarity search , 2007, SIGMETRICS '07.

[24]  Hong Yang,et al.  A LBP-based Face Recognition Method with Hamming Distance Constraint , 2007, Fourth International Conference on Image and Graphics (ICIG 2007).

[25]  F. Frances Yao,et al.  Multi-index hashing for information retrieval , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[26]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.