论文信息 - Organizing News Archives by Near-Duplicate Copy Detection in Digital Libraries

Organizing News Archives by Near-Duplicate Copy Detection in Digital Libraries

There are huge numbers of documents in digital libraries. How to effectively organize these documents so that humans can easily browse or reference is a challenging task. Existing classification methods and chronological or geographical ordering only provide partial views of the news articles. The relationships among news articles might not be easily grasped. In this paper, we propose a near-duplicate copy detection approach to organizing news archives in digital libraries. Conventional copy detection methods use word-level features which could be time-consuming and not robust to term substitutions. In this paper, we propose a sentence-level statistics-based approach to detect near-duplicate documents, which is language independent, simple but effective. It's orthogonal to and can be used to complement word-based approaches. Also it's insensitive to actual page layout of articles. The experimental results showed the high efficiency and good accuracy of the proposed approach in detecting near-duplicates in news archives.

Jenq-Haur Wang | Hung-Chi Chang

[1] Moses Charikar,et al. Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[2] Justin Zobel,et al. Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..

[3] Hector Garcia-Molina,et al. Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[4] Grace Hui Yang,et al. Near-duplicate detection by instance-level constrained clustering , 2006, SIGIR.

[5] Stuart W. Shulman. E-Rulemaking: Issues in Current Research and Practice [1] , 2005 .

[6] Geoffrey Zweig,et al. Syntactic Clustering of the Web , 1997, Comput. Networks.

[7] Wei-Ying Ma,et al. Building implicit links from content for forum search , 2006, SIGIR.

[8] Qiang Yang,et al. A comparison of implicit and explicit links for web page classification , 2006, WWW '06.

[9] Hector Garcia-Molina,et al. SCAM: A Copy Detection Mechanism for Digital Documents , 1995, DL.

[10] Monika Henzinger,et al. Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[11] Dennis Shasha,et al. StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time , 2002, VLDB.