Selective chunking — Easy and effective way to estimate text similarity

Plagiarism is a serious problem especially in academic environment. Basically we define this problem as a theft of stealing somebody else's work or ideas. In this paper we focus on plagiarism in a domain of student assignments written in natural language. We propose an approach that should faster and better identify copied fragments of text data than standard approaches. We first identify topic related pairs of text documents and then select those pairs on further processing that discuss similar topic. We experimented with usage of different chunking methods in the comparison process to overcome typical problems as shorter fragments of text copied from other documents. The results show that our approach is more suitable for plagiarism detection as a standard n-gram method.