Detecting near-duplicates for web crawling

Near-duplicate web documents are abundant. Two such documents differ from each other in a very small portion that displays advertisements, for example. Such differences are irrelevant for web search. So the quality of a web crawler increases if it can assess whether a newly crawled web page is a near-duplicate of a previously crawled web page or not. In the course of developing a near-duplicate detection system for a multi-billion page repository, we make two research contributions. First, we demonstrate that Charikar's fingerprinting technique is appropriate for this goal. Second, we present an algorithmic technique for identifying existing f-bit fingerprints that differ from a given fingerprint in at most k bit-positions, for small k. Our technique is useful for both online queries (single fingerprints) and all batch queries (multiple fingerprints). Experimental evaluation over real data confirms the practicality of our design.

[1]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[2]  Andrew Chi-Chih Yao,et al.  The complexity of searching an ordered random table , 1976, 17th Annual Symposium on Foundations of Computer Science (sfcs 1976).

[3]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[4]  Brenda S. Baker,et al.  A theory of parameterized pattern matching: algorithms and applications , 1993, STOC.

[5]  F. Frances Yao,et al.  Multi-index hashing for information retrieval , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[6]  Noam Nisan,et al.  Neighborhood preserving hashing and approximate queries , 1994, SODA '94.

[7]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[8]  Brenda S. Baker,et al.  On finding duplication and near-duplication in large software systems , 1995, Proceedings of 2nd Working Conference on Reverse Engineering.

[9]  Hector Garcia-Molina,et al.  SCAM: A Copy Detection Mechanism for Digital Documents , 1995, DL.

[10]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[11]  Leszek Gasieniec,et al.  Approximate Dictionary Queries , 1996, CPM.

[12]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[13]  Andrew Chi-Chih Yao,et al.  Dictionary Look-Up with One Error , 1997, J. Algorithms.

[14]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[15]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[16]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[17]  Alan M. Frieze,et al.  Min-wise independent permutations (extended abstract) , 1998, STOC '98.

[18]  M. KleinbergJon Authoritative sources in a hyperlinked environment , 1999 .

[19]  Gerth Stølting Brodal,et al.  Improved Bounds for Dictionary Look-up with One Error , 1999 .

[20]  Andrei Z. Broder,et al.  Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content , 1999, Comput. Networks.

[21]  Monika Henzinger,et al.  Finding Related Pages in the World Wide Web , 1999, Comput. Networks.

[22]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[23]  Andrei Z. Broder,et al.  A Comparison of Techniques to Find Mirrored Hosts on the WWW , 2000, IEEE Data Eng. Bull..

[24]  S. Muthukrishnan,et al.  Selectively estimation for Boolean queries , 2000, PODS '00.

[25]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[26]  Piotr Indyk,et al.  Scalable Techniques for Clustering the Web , 2000, WebDB.

[27]  Edith Cohen,et al.  Finding interesting associations without support pruning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[28]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[29]  MaziéresDavid,et al.  A low-bandwidth network file system , 2001 .

[30]  Sriram Raghavan,et al.  Searching the Web , 2001, ACM Trans. Internet Techn..

[31]  Dimitrios Gunopulos,et al.  Efficient and tumble similar set retrieval , 2001, SIGMOD '01.

[32]  Filippo Menczer,et al.  Evaluating topic-driven web crawlers , 2001, SIGIR '01.

[33]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[34]  James W. Cooper,et al.  Detecting similar documents using salient terms , 2002, CIKM '02.

[35]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.

[36]  Ömer Egecioglu,et al.  Dictionary Look-Up within Small Edit Distance , 2002, COCOON.

[37]  Ophir Frieder,et al.  Collection statistics for fast duplicate document detection , 2002, TOIS.

[38]  Dan Klein,et al.  Evaluating strategies for similarity search on the web , 2002, WWW '02.

[39]  Sachindra Joshi,et al.  A bag of paths model for measuring structural similarity in Web documents , 2003, KDD '03.

[40]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[41]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[42]  Justin Zobel,et al.  Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..

[43]  GhemawatSanjay,et al.  The Google file system , 2003 .

[44]  Andrei Z. Broder,et al.  Efficient URL caching for world wide web crawling , 2003, WWW '03.

[45]  Mohamed S. Kamel,et al.  Efficient phrase-based document indexing for Web document clustering , 2004, IEEE Transactions on Knowledge and Data Engineering.

[46]  Chaomei Chen,et al.  Mining the Web: Discovering knowledge from hypertext data , 2004, J. Assoc. Inf. Sci. Technol..

[47]  Laurel Howe Mirror , 2004 .

[48]  Joshua Alspector,et al.  Improved robustness of signature-based near-replica detection via lexicon randomization , 2004, KDD.

[49]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[50]  Jack G. Conrad,et al.  Constructing a text corpus for inexact duplicate detection , 2004, SIGIR '04.

[51]  Sandeep Pandey,et al.  User-centric Web crawling , 2005, WWW '05.

[52]  Vladik Kreinovich,et al.  Mining the Web: Discovering Knowledge from Hypertext Data, by Soumen Chakrabarti and Morgan Kaufmann , 2005, J. Intell. Fuzzy Syst..

[53]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.