论文信息 - Near duplicate detection in an academic digital library

Near duplicate detection in an academic digital library

The detection and potential removal of duplicates is desirable for a number of reasons, such as to reduce the need for unnecessary storage and computation, and to provide users with uncluttered search results. This paper describes an investigation into the application of scalable simhash and shingle state of the art duplicate detection algorithms for detecting near duplicate documents in the CiteSeerX digital library. We empirically explored the duplicate detection methods and evaluated their performance and application to academic documents and identified good parameters for the algorithms. We also analyzed the types of near duplicates identified by each algorithm. The highest F-scores achieved were 0.91 and 0.99 for the simhash and shingle-based methods respectively. The shingle-based method also identified a larger variety of duplicate types than the simhash-based method.

C. Lee Giles | Kyle Williams | Kyle Williams

[1] Ophir Frieder,et al. Collection statistics for fast duplicate document detection , 2002, TOIS.

[2] Monika Henzinger,et al. Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[3] Gurmeet Singh Manku,et al. Detecting near-duplicates for web crawling , 2007, WWW '07.

[4] Moses Charikar,et al. Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[5] Susan Gauch,et al. Document similarity based on concept tree distance , 2008, Hypertext.

[6] J. A. Chandulal,et al. Signature Based Duplication Detection in Digital Libraries , 2006 .

[7] Geoffrey Zweig,et al. Syntactic Clustering of the Web , 1997, Comput. Networks.

[8] R. Manmatha,et al. Partial duplicate detection for large book collections , 2011, CIKM '11.