A Copy Detection Method for Chinese Text by Character Based N-gram

This paper studies Chinese character-based n-grams used in copy detection by Sogou Chinese News Corpus. Texts were compared each other to find out the same news event written by different authors which were viewed as plagiarism by variable length (2<=n<=10) n-gram comparisons. The experiments show that unlike English, higher level (n={4, 5}) n-gram can improve precision significantly with no recall declined in Chinese text copy detection.

[1]  Alberto Barrón-Cedeño,et al.  On Automatic Plagiarism Detection Based on n-Grams Comparison , 2009, ECIR.

[2]  Yiu-Kai Ng,et al.  A Sentence-Based Copy Detection Approach for Web Documents , 2005, FSKD.

[3]  James A. Malcolm,et al.  Plagiarism is Easy, but also Easy To Detect , 2006 .

[4]  Sun Ping,et al.  The Research of Chinese Semantic Similarity Calculation Introduced Punctuations , 2010, J. Convergence Inf. Technol..

[5]  Peter C. R. Lane,et al.  Copy detection in Chinese documents using Ferret , 2007, Lang. Resour. Evaluation.

[6]  Behrouz Minaei-Bidgoli,et al.  Optimizing Document Similarity Detection in Persian Information Retrieval , 2010, J. Convergence Inf. Technol..

[7]  Douglas M. Campbell,et al.  Copy detection systems for digital documents , 2000, Proceedings IEEE Advances in Digital Libraries 2000.

[8]  Hongjun Zhu,et al.  N-gram Statistics in English and Chinese: Similarities and Differences , 2007 .

[9]  James A. Malcolm,et al.  Detecting Short Passages of Similar Text in Large Document Collections , 2001, EMNLP.

[10]  Jian Zhang,et al.  On the use of words and n-grams for Chinese information retrieval , 2000, IRAL '00.

[11]  Xian-Yi Cheng,et al.  The Recognition Method of Unknown Chinese Words in Fragments Based on Mutual Information , 2010, J. Convergence Inf. Technol..

[12]  Qinbao Song,et al.  A new text feature extraction model and its application in document copy detection , 2003, Proceedings of the 2003 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.03EX693).