论文信息 - Selective chunking — Easy and effective way to estimate text similarity

Selective chunking — Easy and effective way to estimate text similarity

Plagiarism is a serious problem especially in academic environment. Basically we define this problem as a theft of stealing somebody else's work or ideas. In this paper we focus on plagiarism in a domain of student assignments written in natural language. We propose an approach that should faster and better identify copied fragments of text data than standard approaches. We first identify topic related pairs of text documents and then select those pairs on further processing that discuss similar topic. We experimented with usage of different chunking methods in the comparison process to overcome typical problems as shorter fragments of text copied from other documents. The results show that our approach is more suitable for plagiarism detection as a standard n-gram method.

Daniela Chuda | Tomas Kucecka | Patrik Samuhel

[1] Benjamin C. M. Fung,et al. Hierarchical Document Clustering using Frequent Itemsets , 2003, SDM.

[2] P. C. Wong,et al. Generalized vector spaces model in information retrieval , 1985, SIGIR '85.

[3] Máté Pataki. Plagiarism Detection and Document Chunking Methods , 2003, WWW.

[4] Hector Garcia-Molina,et al. Building a scalable and accurate copy detection mechanism , 1996, DL '96.

[5] Na Wang,et al. An improved TF-IDF weights function based on information theory , 2010, 2010 International Conference on Computer and Communication Technologies in Agriculture Engineering.

[6] Tomás Kucecka. Obfuscating plagiarism detection: vulnerabilities and solutions , 2011, CompSysTech '11.

[7] Demetrios G. Glinos. ATA-Sem: Chunk-based Determination of Semantic Text Similarity , 2012, SemEval@NAACL-HLT.

[8] Arkady B. Zaslavsky,et al. Signature Extraction for Overlap Detection in Documents , 2002, ACSC.

[9] Norman Meuschke,et al. State-of-the-art in detecting academic plagiarism , 2013 .