Multi Queries Methods of the Chinese-English Bilingual Plagiarism Detection

Cross-language plagiarism detection identifies and extracts plagiarized text in a multilingual environment. In recent years, there has been a significant amount of work done involving English and European text. However, somewhat less attention has been paid to Asia languages. We compared a number of different strategies for Chinese-English bilingual plagiarism detection. We present methods for candidate document retrieval and compare four methods: (i) document keywords based, (ii) intrinsic plagiarism based, (iii) headers based, and (iv) machine translation queries. The results of our evaluation indicated that keywords based queries, the simplest and most efficient approach, gives acceptable results for newspaper articles. We also compared different percentage of keywords based query, and the results indicated that putting 50% keywords into queries can obtain the satisfied candidate documents set.