论文信息 - Managing déjà vu: Collection building for the identification of nonidentical duplicate documents

Managing déjà vu: Collection building for the identification of nonidentical duplicate documents

As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. Few users wish to retrieve search results consisting of sets of duplicate documents, whether identical duplicates or close variants. The goal of this work is to facilitate (a) investigations into the phenomenon of near duplicates and (b) algorithmic approaches to minimizing its deleterious effect on search results. Harnessing the expertise of both client-users and professional searchers, we establish principled methods to generate a test collection for identifying and handling nonidentical duplicate documents. We subsequently examine a flexible method of characterizing and comparing documents to permit the identification of near duplicates. This method has produced promising results following an extensive evaluation using a production-based test collection created by domain experts.

Jack G. Conrad | Cindy P. Schriber

[1] Justin Zobel,et al. Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..

[2] Sergey Brin,et al. The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[3] Stephen P. Harter,et al. Variations in Relevance Assessments and the Measurement of Retrieval Effectiveness , 1996, J. Am. Soc. Inf. Sci..

[4] Stephen E. Robertson,et al. Building a filtering test collection for TREC 2002 , 2003, SIGIR.

[5] Hector Garcia-Molina,et al. Finding near-replicas of documents on the Web , 1999 .

[6] Hector Garcia-Molina,et al. Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[7] Daniel Shawcross Wilkerson,et al. Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[8] Marc Najork,et al. On the evolution of clusters of near-duplicate Web pages , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).

[9] Bernice W. Polemis. Nonparametric Statistics for the Behavioral Sciences , 1959 .

[10] Charles L. A. Clarke,et al. Efficient construction of large test collections , 1998, SIGIR '98.

[11] Mark Stevenson,et al. The Reuters Corpus Volume 1 -from Yesterday’s News to Tomorrow’s Language Resources , 2002, LREC.

[12] K. Sparck Jones,et al. INFORMATION RETRIEVAL TEST COLLECTIONS , 1976 .

[13] Cyril W. Cleverdon. The effect of variations in relevance assessments in comparative experimental tests of index languages , 1970 .

[14] Carol Tenopir,et al. TARGET and FREESTYLE: DIALOG and Mead join the relevance ranks , 1997 .

[15] C. J. van Rijsbergen,et al. Report on the need for and provision of an 'ideal' information retrieval test collection , 1975 .

[16] Peter Jackson,et al. Natural language processing for online applications : text retrieval, extraction and categorization , 2002 .

[17] Jean Carletta,et al. Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[18] Helen R. Tibbo,et al. The Cystic Fibrosis Database: Content and Research Opportunities. , 1991 .

[19] Ellen M. Voorhees,et al. Evaluating Evaluation Measure Stability , 2000, SIGIR 2000.

[20] Carmen Miller. Detecting duplicates: a searcher's dream come true , 1990 .