CentralMatch: A Fast and Accurate Method to Identify Blog-Duplicates

A group of documents is called near-duplicates if they are almost the same with just a slight difference. Since near-duplicates are major concerns of Web search engines, it is necessary to identify and filter them effectively. Among existing near-duplicate identification methods, MinHashing is the most well-known one. It identifies near-duplicates regardless of locations of different parts in two documents. In blog environment, however, most near-duplicates differ only in their beginning or end. According to our preliminary experiment, about 99% of near-duplicates differ in the beginning or end (blog-duplicates hereafter) and only 1% of them differ in the middle. Thus, blog-duplicates have a long matched sequence in their central parts. Based on this important observation, we present a novel algorithm, Central Match, to identify blog-duplicates efficiently and accurately. When searching a document database for possible log-duplicates of a given document, Central Match runs50 times faster than MinHashing. In addition, Central Match identifies blog-duplicates more accurately than MinHashing. According to our experiments, when the precisions of Min-Hashing and Central Match are fixed to 0.9, their recalls are around 0.5 and 0.9, respectively, which means Central Match finds 80% more blog-duplicates than MinHashing.

[1]  Grace Hui Yang,et al.  Near-duplicate detection by instance-level constrained clustering , 2006, SIGIR.

[2]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[3]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[4]  Andreas Paepcke,et al.  SpotSigs: Near Duplicate Detection in Web Page Collections , 2007 .

[5]  Justin Zobel,et al.  Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..

[6]  Andrei Z. Broder,et al.  Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[7]  Ophir Frieder,et al.  Collection statistics for fast duplicate document detection , 2002, TOIS.

[8]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[9]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[10]  Hector Garcia-Molina,et al.  Finding near-replicas of documents on the Web , 1999 .

[11]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[12]  David M. Pennock,et al.  Analysis of lexical signatures for finding lost or related documents , 2002, SIGIR '02.

[13]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[14]  Dan Klein,et al.  Evaluating strategies for similarity search on the web , 2002, WWW '02.

[15]  Jack G. Conrad,et al.  Online duplicate document detection: signature reliability in a dynamic retrieval environment , 2003, CIKM '03.

[16]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[17]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[18]  Hector Garcia-Molina,et al.  SCAM: A Copy Detection Mechanism for Digital Documents , 1995, DL.