论文信息 - CentralMatch: A Fast and Accurate Method to Identify Blog-Duplicates

CentralMatch: A Fast and Accurate Method to Identify Blog-Duplicates

A group of documents is called near-duplicates if they are almost the same with just a slight difference. Since near-duplicates are major concerns of Web search engines, it is necessary to identify and filter them effectively. Among existing near-duplicate identification methods, MinHashing is the most well-known one. It identifies near-duplicates regardless of locations of different parts in two documents. In blog environment, however, most near-duplicates differ only in their beginning or end. According to our preliminary experiment, about 99% of near-duplicates differ in the beginning or end (blog-duplicates hereafter) and only 1% of them differ in the middle. Thus, blog-duplicates have a long matched sequence in their central parts. Based on this important observation, we present a novel algorithm, Central Match, to identify blog-duplicates efficiently and accurately. When searching a document database for possible log-duplicates of a given document, Central Match runs50 times faster than MinHashing. In addition, Central Match identifies blog-duplicates more accurately than MinHashing. According to our experiments, when the precisions of Min-Hashing and Central Match are fixed to 0.9, their recalls are around 0.5 and 0.9, respectively, which means Central Match finds 80% more blog-duplicates than MinHashing.

Sang-Chul Lee | Sang-Wook Kim | Heejin Park | Soon-Haeng Lee

[1] Grace Hui Yang,et al. Near-duplicate detection by instance-level constrained clustering , 2006, SIGIR.

[2] Hector Garcia-Molina,et al. Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[3] Geoffrey Zweig,et al. Syntactic Clustering of the Web , 1997, Comput. Networks.

[4] Andreas Paepcke,et al. SpotSigs: Near Duplicate Detection in Web Page Collections , 2007 .

[5] Justin Zobel,et al. Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..

[6] Andrei Z. Broder,et al. Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[7] Ophir Frieder,et al. Collection statistics for fast duplicate document detection , 2002, TOIS.

[8] Monika Henzinger,et al. Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[9] Dan Gusfield,et al. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[10] Hector Garcia-Molina,et al. Finding near-replicas of documents on the Web , 1999 .

[11] Hans-Peter Kriegel,et al. The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[12] David M. Pennock,et al. Analysis of lexical signatures for finding lost or related documents , 2002, SIGIR '02.

[13] Daniel Shawcross Wilkerson,et al. Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[14] Dan Klein,et al. Evaluating strategies for similarity search on the web , 2002, WWW '02.

[15] Jack G. Conrad,et al. Online duplicate document detection: signature reliability in a dynamic retrieval environment , 2003, CIKM '03.

[16] Sergey Brin,et al. The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[17] Alan M. Frieze,et al. Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[18] Hector Garcia-Molina,et al. SCAM: A Copy Detection Mechanism for Digital Documents , 1995, DL.