Copy detection in Chinese documents using the Ferret: a report on experiments

The Ferret copy detector has been used for some years on English texts to find plagiarism in large collections of students’ coursework. This article reports on extending its application to Chinese, which differs from English in many respects: the sequence of characters that make up a Chinese text do not have word boundaries marked, there is a vast Chinese “alphabet”, or number of different characters, and they are represented with multi-byte encoding. We discuss issues of representation, focus on the effectiveness of a sub-symbolic approach, and show how the Ferret can circumvent the classic problem of finding word boundaries with an automated system. Corpora of students’ coursework from two Chinese universities have been collected, and we apply Ferret to investigate the detection of plagiarism. Our experiments show that Ferret can find both artificially constructed plagiarism as well as actually occurring, previously undetected plagiarism. We also investigate the parameters of the system, and report on typical optimum settings. Experiments reported in this article show that Ferret can work well on Chinese texts, and achieve a consistent performance. The investigation into the representation of written Chinese is likely to be of use in other language processing tasks.