Managing déjà vu: Collection building for the identification of nonidentical duplicate documents

As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. Few users wish to retrieve search results consisting of sets of duplicate documents, whether identical duplicates or close variants. The goal of this work is to facilitate (a) investigations into the phenomenon of near duplicates and (b) algorithmic approaches to minimizing its deleterious effect on search results. Harnessing the expertise of both client-users and professional searchers, we establish principled methods to generate a test collection for identifying and handling nonidentical duplicate documents. We subsequently examine a flexible method of characterizing and comparing documents to permit the identification of near duplicates. This method has produced promising results following an extensive evaluation using a production-based test collection created by domain experts. © 2006 Wiley Periodicals, Inc.

[1]  Stephen E. Robertson,et al.  Building a filtering test collection for TREC 2002 , 2003, SIGIR.

[2]  Yi Zhang,et al.  Novelty and redundancy detection in adaptive filtering , 2002, SIGIR '02.

[3]  Joshua Alspector,et al.  Improved robustness of signature-based near-replica detection via lexicon randomization , 2004, KDD.

[4]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[5]  Jack G. Conrad,et al.  Online duplicate document detection: signature reliability in a dynamic retrieval environment , 2003, CIKM '03.

[6]  Hector Garcia-Molina,et al.  Finding Near-Replicas of Documents and Servers on the Web , 1998, WebDB.

[7]  S. Siegel,et al.  Nonparametric Statistics for the Behavioral Sciences , 2022, The SAGE Encyclopedia of Research Design.

[8]  Justin Zobel,et al.  Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..

[9]  Carmen Miller Detecting duplicates: a searcher's dream come true , 1990 .

[10]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[11]  Peter Bailey,et al.  Overview of the TREC-8 Web Track , 2000, TREC.

[12]  David M. Pennock,et al.  Analysis of lexical signatures for finding lost or related documents , 2002, SIGIR '02.

[13]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[14]  Charles L. A. Clarke,et al.  Efficient construction of large test collections , 1998, SIGIR '98.

[15]  Peter Schäuble,et al.  Building a Large Multilingual Test Collection from Comparable News Documents , 1998 .

[16]  James W. Cooper,et al.  Detecting similar documents using salient terms , 2002, CIKM '02.

[17]  C. J. van Rijsbergen,et al.  Report on the need for and provision of an 'ideal' information retrieval test collection , 1975 .

[18]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[19]  Marc Najork,et al.  On the evolution of clusters of near-duplicate Web pages , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).

[20]  Helen R. Tibbo,et al.  The Cystic Fibrosis Database: Content and Research Opportunities. , 1991 .

[21]  Ellen M. Voorhees Variations in relevance judgments and the measurement of retrieval effectiveness , 2000, Inf. Process. Manag..

[22]  Robert Burgin Variations in Relevance Judgments and the Evaluation of Retrieval Performance , 1992, Inf. Process. Manag..

[23]  Ellen M. Voorhees,et al.  Evaluating Evaluation Measure Stability , 2000, SIGIR 2000.

[24]  Mark Stevenson,et al.  The Reuters Corpus Volume 1 -from Yesterday’s News to Tomorrow’s Language Resources , 2002, LREC.

[25]  Ophir Frieder,et al.  Collection statistics for fast duplicate document detection , 2002, TOIS.

[26]  Donna K. Harman,et al.  Overview of the Sixth Text REtrieval Conference (TREC-6) , 1997, Inf. Process. Manag..

[27]  Stephen P. Harter Variations in relevance assessments and the measurement of retrieval effectiveness , 1996 .

[28]  Jeannette M. Wing,et al.  Model checking electronic commerce protocols , 1996 .

[29]  Peter Jackson,et al.  Natural language processing for online applications : text retrieval, extraction and categorization , 2002 .

[30]  Peter Jackson,et al.  Briefly noted: natural language processing for online applications: Text retrieval, extraction, and categorization , 2003 .

[31]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[32]  K. Sparck Jones,et al.  INFORMATION RETRIEVAL TEST COLLECTIONS , 1976 .

[33]  Stephen E. Robertson,et al.  On relevance weights with little relevance information , 1997, SIGIR '97.

[34]  Howard R. Turtle Natural language vs. Boolean query evaluation: a comparison of retrieval performance , 1994, SIGIR '94.

[35]  M. Sanderson,et al.  Duplicate Detection in the Reuters Collection , 1997 .

[36]  Daniel Marcu,et al.  The automatic construction of large-scale corpora for summarization research , 1999, SIGIR '99.

[37]  Andrew Levison Ziff-Davis: sale of publishing giant impacts online industry , 1994 .

[38]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[39]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.