论文信息 - A fast document copy detection model

A fast document copy detection model

Text similarity measure is a common issue in Information Retrieval, Text Mining, Web Mining, Text Classification/Clustering and Document Copy Detection etc. The most popular approach is word frequency based scheme, which uses a word frequency vector to represent a document. Cosine function, dot product and proportion function are regular similarity measures of vector. But they are symmetric similarity measures, which cannot find out the subset copies. In this paper we present the concepts of asymmetric similarity model and heavy frequency vector (HFV). The former can detect subset copies well, and the latter can save a great resources and CPU time. We have developed two new asymmetric measures: heavy frequency vector (HFM) and Heavy inclusion proportion model HIPM. The HFM and HIPM are derived from cosine function and proportion function by combining asymmetric similarity concept with HFV. The HFV is to truncate the original full frequency vector to a short vector. We can adjust the parameter of HFV to balance the model’s performance. The paper illustrates the aspects of asymmetric similarity and HFV models by several experiments.

Xiao-Dong Liu | Jun-Yi Shen | Jun-Peng Bao | Hai-Yan Liu

[1] Peter J. Denning. Plagiarism in the web , 1995, CACM.

[2] Hector Garcia-Molina,et al. SCAM: A Copy Detection Mechanism for Digital Documents , 1995, DL.

[3] Arkady B. Zaslavsky,et al. MatchDetectReveal: finding overlapping and similar digital documents , 2000, IRMA Conference.

[4] Hector Garcia-Molina,et al. Finding Near-Replicas of Documents and Servers on the Web , 1998, WebDB.

[5] Luis Gravano,et al. dSCAM: finding document copies across multiple databases , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[6] Rynson W. H. Lau,et al. CHECK: a document plagiarism detection system , 1997, SAC '97.

[7] Hector Garcia-Molina,et al. Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[8] Geoffrey Zweig,et al. Syntactic Clustering of the Web , 1997, Comput. Networks.