Probabilistic near-duplicate detection using simhash

This paper offers a novel look at using a dimensionality-reduction technique called simhash to detect similar document pairs in large-scale collections. We show that this algorithm produces interesting intermediate data, which is normally discarded, that can be used to predict which of the bits in the final hash are more susceptible to being flipped in similar documents. This paves the way for a probabilistic search technique in the Hamming space of simhashes that can be significantly faster and more space-efficient than the existing simhash approaches. We show that with 95% recall compared to deterministic search of prior work, our method exhibits 4-14 times faster lookup and requires 2-10 times less RAM on our collection of 70M web pages.

[1]  Benno Stein,et al.  Strategies for retrieving plagiarized documents , 2007, SIGIR.

[2]  Geoffrey E. Hinton,et al.  Semantic hashing , 2009, Int. J. Approx. Reason..

[3]  Peter Wegner,et al.  A technique for counting ones in a binary computer , 1960, CACM.

[4]  Edith Cohen,et al.  Finding interesting associations without support pruning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[5]  Hans-Peter Kriegel,et al.  Incremental Clustering for Mining in a Data Warehousing Environment , 1998, VLDB.

[6]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[7]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[8]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[9]  Andrei Z. Broder,et al.  Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[10]  Mayank Bawa,et al.  LSH forest: self-tuning indexes for similarity search , 2005, WWW '05.

[11]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[12]  Ophir Frieder,et al.  Collection statistics for fast duplicate document detection , 2002, TOIS.

[13]  Dmitri Loguinov,et al.  IRLbot: scaling to 6 billion pages and beyond , 2008, WWW.

[14]  Jon Louis Bentley,et al.  K-d trees for semidynamic point sets , 1990, SCG '90.

[15]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[16]  Alan M. Frieze,et al.  Min-wise independent permutations (extended abstract) , 1998, STOC '98.

[17]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[18]  Marios Hadjieleftheriou,et al.  R-Trees - A Dynamic Index Structure for Spatial Searching , 2008, ACM SIGSPATIAL International Workshop on Advances in Geographic Information Systems.

[19]  Shumeet Baluja,et al.  Learning "Forgiving" Hash Functions: Algorithms and Large Scale Tests , 2007, IJCAI.

[20]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..