Managing déjà vu: Collection building for the identification of nonidentical duplicate documents

As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. Few users wish to retrieve search results consisting of sets of duplicate documents, whether identical duplicates or close variants. The goal of this work is to facilitate (a) investigations into the phenomenon of near duplicates and (b) algorithmic approaches to minimizing its deleterious effect on search results. Harnessing the expertise of both client-users and professional searchers, we establish principled methods to generate a test collection for identifying and handling nonidentical duplicate documents. We subsequently examine a flexible method of characterizing and comparing documents to permit the identification of near duplicates. This method has produced promising results following an extensive evaluation using a production-based test collection created by domain experts.

[1]  Justin Zobel,et al.  Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..

[2]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[3]  Stephen P. Harter,et al.  Variations in Relevance Assessments and the Measurement of Retrieval Effectiveness , 1996, J. Am. Soc. Inf. Sci..

[4]  Stephen E. Robertson,et al.  Building a filtering test collection for TREC 2002 , 2003, SIGIR.

[5]  Hector Garcia-Molina,et al.  Finding near-replicas of documents on the Web , 1999 .

[6]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[7]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[8]  Marc Najork,et al.  On the evolution of clusters of near-duplicate Web pages , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).

[9]  Bernice W. Polemis Nonparametric Statistics for the Behavioral Sciences , 1959 .

[10]  Charles L. A. Clarke,et al.  Efficient construction of large test collections , 1998, SIGIR '98.

[11]  Mark Stevenson,et al.  The Reuters Corpus Volume 1 -from Yesterday’s News to Tomorrow’s Language Resources , 2002, LREC.

[12]  K. Sparck Jones,et al.  INFORMATION RETRIEVAL TEST COLLECTIONS , 1976 .

[13]  Cyril W. Cleverdon The effect of variations in relevance assessments in comparative experimental tests of index languages , 1970 .

[14]  Carol Tenopir,et al.  TARGET and FREESTYLE: DIALOG and Mead join the relevance ranks , 1997 .

[15]  C. J. van Rijsbergen,et al.  Report on the need for and provision of an 'ideal' information retrieval test collection , 1975 .

[16]  Peter Jackson,et al.  Natural language processing for online applications : text retrieval, extraction and categorization , 2002 .

[17]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[18]  Helen R. Tibbo,et al.  The Cystic Fibrosis Database: Content and Research Opportunities. , 1991 .

[19]  Ellen M. Voorhees,et al.  Evaluating Evaluation Measure Stability , 2000, SIGIR 2000.

[20]  Carmen Miller Detecting duplicates: a searcher's dream come true , 1990 .

[21]  Peter Schäuble,et al.  Building a Large Multilingual Test Collection from Comparable News Documents , 1998 .

[22]  James W. Cooper,et al.  Detecting similar documents using salient terms , 2002, CIKM '02.

[23]  Peter Bailey,et al.  Overview of the TREC-8 Web Track , 2000, TREC.

[24]  Daniel Marcu,et al.  The automatic construction of large-scale corpora for summarization research , 1999, SIGIR '99.

[25]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[26]  Jack G. Conrad,et al.  Online duplicate document detection: signature reliability in a dynamic retrieval environment , 2003, CIKM '03.

[27]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[28]  Yi Zhang,et al.  Novelty and redundancy detection in adaptive filtering , 2002, SIGIR '02.

[29]  Ophir Frieder,et al.  Collection statistics for fast duplicate document detection , 2002, TOIS.

[30]  Donna K. Harman,et al.  Overview of the Sixth Text REtrieval Conference (TREC-6) , 1997, Inf. Process. Manag..

[31]  Howard R. Turtle Natural language vs. Boolean query evaluation: a comparison of retrieval performance , 1994, SIGIR '94.

[32]  M. Sanderson,et al.  Duplicate Detection in the Reuters Collection , 1997 .

[33]  Ellen M. Voorhees Variations in relevance judgments and the measurement of retrieval effectiveness , 2000, Inf. Process. Manag..

[34]  Robert Burgin Variations in Relevance Judgments and the Evaluation of Retrieval Performance , 1992, Inf. Process. Manag..

[35]  David M. Pennock,et al.  Analysis of lexical signatures for finding lost or related documents , 2002, SIGIR '02.

[36]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[37]  Tefko Saracevic Users lost: reflections on the past, future, and limits of information science , 1997, SIGF.