The Influence of Text Pre-processing on Plagiarism Detection

This paper explores the influence of text preprocessing techniques on plagiarism detection. We examine stop-word removal, lemmatization,number replacement, synonymy recognition, and word generalization. We also look into the influence of punctuation and word-order within N-grams. All these techniques are evaluated according to their impact on F1-measure and speed of execution. Our experiments were performed on a Czech corpus of plagiarized documents about politics. At the end of this paper, we propose what we consider to be the best combination of text pre-processing techniques.