论文信息 - Efficient partial-duplicate detection based on sequence matching

Efficient partial-duplicate detection based on sequence matching

With the ever-increasing growth of the Internet, numerous copies of documents become serious problem for search engine, opinion mining and many other web applications. Since partial-duplicates only contain a small piece of text taken from other sources and most existing near-duplicate detection approaches focus on document level, partial duplicates can not be dealt with well. In this paper, we propose a novel algorithm to realize the partial-duplicate detection task. Besides the similarities between documents, our proposed algorithm can simultaneously locate the duplicated parts. The main idea is to divide the partial-duplicate detection task into two subtasks: sentence level near-duplicate detection and sequence matching. For evaluation, we compare the proposed method with other approaches on both English and Chinese web collections. Experimental results appear to support that our proposed method is effectively and efficiently to detect both partial-duplicates on large web collections.

[1] Andrei Z. Broder,et al. On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[2] Leslie G. Valiant,et al. A bridging model for parallel computation , 1990, CACM.

[3] Lynette Hirschman,et al. MITRE: Description of the Alembic System Used for MUC-6 , 1995, MUC.

[4] Joshua Alspector,et al. Improved robustness of signature-based near-replica detection via lexicon randomization , 2004, KDD.

[5] Piotr Indyk,et al. Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[6] Ricardo Baeza-Yates,et al. Information Retrieval: Data Structures and Algorithms , 1992 .

[7] Andrei Z. Broder,et al. Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[8] Alexander Löser,et al. Near-duplicate detection for web-forums , 2009, IDEAS '09.

[9] Geoffrey Zweig,et al. Syntactic Clustering of the Web , 1997, Comput. Networks.

[10] Venkata Subramaniam,et al. Information Retrieval: Data Structures & Algorithms , 1992 .

[11] Hector Garcia-Molina,et al. Finding Near-Replicas of Documents and Servers on the Web , 1998, WebDB.

[12] Daniel Shawcross Wilkerson,et al. Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[13] Andreas Paepcke,et al. SpotSigs: robust and efficient near duplicate detection in large web collections , 2008, SIGIR '08.