论文信息 - Detecting near-duplicates for web crawling

Detecting near-duplicates for web crawling

Near-duplicate web documents are abundant. Two such documents differ from each other in a very small portion that displays advertisements, for example. Such differences are irrelevant for web search. So the quality of a web crawler increases if it can assess whether a newly crawled web page is a near-duplicate of a previously crawled web page or not. In the course of developing a near-duplicate detection system for a multi-billion page repository, we make two research contributions. First, we demonstrate that Charikar's fingerprinting technique is appropriate for this goal. Second, we present an algorithmic technique for identifying existing f-bit fingerprints that differ from a given fingerprint in at most k bit-positions, for small k. Our technique is useful for both online queries (single fingerprints) and all batch queries (multiple fingerprints). Experimental evaluation over real data confirms the practicality of our design.

[1] David A. Huffman,et al. A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[2] Andrew Chi-Chih Yao,et al. The complexity of searching an ordered random table , 1976, 17th Annual Symposium on Foundations of Computer Science (sfcs 1976).

[3] T. Landauer,et al. Indexing by Latent Semantic Analysis , 1990 .

[4] Brenda S. Baker,et al. A theory of parameterized pattern matching: algorithms and applications , 1993, STOC.

[5] F. Frances Yao,et al. Multi-index hashing for information retrieval , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[6] Noam Nisan,et al. Neighborhood preserving hashing and approximate queries , 1994, SODA '94.

[7] Udi Manber,et al. Finding Similar Files in a Large File System , 1994, USENIX Winter.

[8] Brenda S. Baker,et al. On finding duplication and near-duplication in large software systems , 1995, Proceedings of 2nd Working Conference on Reverse Engineering.

[9] Hector Garcia-Molina,et al. SCAM: A Copy Detection Mechanism for Digital Documents , 1995, DL.

[10] Hector Garcia-Molina,et al. Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[11] Leszek Gasieniec,et al. Approximate Dictionary Queries , 1996, CPM.

[12] Andrei Z. Broder,et al. On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[13] Andrew Chi-Chih Yao,et al. Dictionary Look-Up with One Error , 1997, J. Algorithms.

[14] Geoffrey Zweig,et al. Syntactic Clustering of the Web , 1997, Comput. Networks.

[15] Sergey Brin,et al. The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[16] Hector Garcia-Molina,et al. Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[17] Alan M. Frieze,et al. Min-wise independent permutations (extended abstract) , 1998, STOC '98.

[18] M. KleinbergJon. Authoritative sources in a hyperlinked environment , 1999 .

[19] Gerth Stølting Brodal,et al. Improved Bounds for Dictionary Look-up with One Error , 1999 .

[20] Andrei Z. Broder,et al. Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content , 1999, Comput. Networks.

[21] Monika Henzinger,et al. Finding Related Pages in the World Wide Web , 1999, Comput. Networks.

[22] Ravi Kumar,et al. Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[23] Andrei Z. Broder,et al. A Comparison of Techniques to Find Mirrored Hosts on the WWW , 2000, IEEE Data Eng. Bull..

[24] S. Muthukrishnan,et al. Selectively estimation for Boolean queries , 2000, PODS '00.

[25] Alan M. Frieze,et al. Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[26] Piotr Indyk,et al. Scalable Techniques for Clustering the Web , 2000, WebDB.

[27] Edith Cohen,et al. Finding interesting associations without support pruning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[28] Marco Gori,et al. Focused Crawling Using Context Graphs , 2000, VLDB.

[29] MaziéresDavid,et al. A low-bandwidth network file system , 2001 .

[30] Sriram Raghavan,et al. Searching the Web , 2001, ACM Trans. Internet Techn..

[31] Dimitrios Gunopulos,et al. Efficient and tumble similar set retrieval , 2001, SIGMOD '01.

[32] Filippo Menczer,et al. Evaluating topic-driven web crawlers , 2001, SIGIR '01.

[33] Moses Charikar,et al. Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[34] James W. Cooper,et al. Detecting similar documents using salient terms , 2002, CIKM '02.

[35] Sean Quinlan,et al. Venti: A New Approach to Archival Storage , 2002, FAST.

[36] Ömer Egecioglu,et al. Dictionary Look-Up within Small Edit Distance , 2002, COCOON.

[37] Ophir Frieder,et al. Collection statistics for fast duplicate document detection , 2002, TOIS.

[38] Dan Klein,et al. Evaluating strategies for similarity search on the web , 2002, WWW '02.

[39] Sachindra Joshi,et al. A bag of paths model for measuring structural similarity in Web documents , 2003, KDD '03.

[40] Daniel Shawcross Wilkerson,et al. Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[41] Hector Garcia-Molina,et al. Extracting structured data from Web pages , 2003, SIGMOD '03.

[42] Justin Zobel,et al. Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..

[43] GhemawatSanjay,et al. The Google file system , 2003 .

[44] Andrei Z. Broder,et al. Efficient URL caching for world wide web crawling , 2003, WWW '03.

[45] Mohamed S. Kamel,et al. Efficient phrase-based document indexing for Web document clustering , 2004, IEEE Transactions on Knowledge and Data Engineering.

[46] Chaomei Chen,et al. Mining the Web: Discovering knowledge from hypertext data , 2004, J. Assoc. Inf. Sci. Technol..

[47] Laurel Howe. Mirror , 2004 .

[48] Joshua Alspector,et al. Improved robustness of signature-based near-replica detection via lexicon randomization , 2004, KDD.

[49] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[50] Jack G. Conrad,et al. Constructing a text corpus for inexact duplicate detection , 2004, SIGIR '04.

[51] Sandeep Pandey,et al. User-centric Web crawling , 2005, WWW '05.

[52] Vladik Kreinovich,et al. Mining the Web: Discovering Knowledge from Hypertext Data, by Soumen Chakrabarti and Morgan Kaufmann , 2005, J. Intell. Fuzzy Syst..

[53] Monika Henzinger,et al. Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.