论文信息 - Constructing a text corpus for inexact duplicate detection

Constructing a text corpus for inexact duplicate detection

As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. The goal of this work is to facilitate (a) investigations into the phenomenon of near duplicates and (b) algorithmic approaches to minimizing its negative effect on search results. Harnessing the expertise of both client-users and professional searchers, we establish principled methods to generate a test collection for identifying and handling inexact duplicate documents.

Jack G. Conrad | Cindy P. Schriber

[1] Geoffrey Zweig,et al. Syntactic Clustering of the Web , 1997, Comput. Networks.

[2] Hector Garcia-Molina,et al. Finding near-replicas of documents on the Web , 1999 .

[3] Jean Carletta,et al. Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[4] Jack G. Conrad,et al. Online duplicate document detection: signature reliability in a dynamic retrieval environment , 2003, CIKM '03.

[5] Howard R. Turtle. Natural language vs. Boolean query evaluation: a comparison of retrieval performance , 1994, SIGIR '94.

[6] Hector Garcia-Molina,et al. Finding Near-Replicas of Documents and Servers on the Web , 1998, WebDB.

[7] Ophir Frieder,et al. Collection statistics for fast duplicate document detection , 2002, TOIS.