Plagiarism detection in text using Vector Space Model

Plagiarism denotes the act of copying someone else's idea (or, works) and claiming it as his/her own. Plagiarism detection is the procedure to detect the texts of a given document which are plagiarized, i.e. copied from from some other documents. Potential challenges are due to the facts that plagiarists often obfuscate the copied texts; might shuffle, remove, insert, or replace words or short phrases; might also restructure the sentences replacing words with synonyms; and changing the order of appearances of words in a sentence. In this paper we propose a technique based on textual similarity for external plagiarism detection. For a given suspicious document we have to identify the set of source documents from which the suspicious document is copied. The method we propose comprises of four phases. In the first phase, we process all the documents to generate tokens, lemmas, finding Part-of-Speech (PoS) classes, character-offsets, sentence numbers and named-entity (NE) classes. In the second phase we select a subset of documents that may possibly be the sources of plagiarism. We use an approach based on the traditional Vector Space Model (VSM) for this candidate selection. In the third phase we use a graph-based approach to find out the similar passages in suspicious document and selected source documents. Finally we filter out the false detections1.

[1]  Byung-Ryul Ahn,et al.  Plagiarism Detection Using the Levenshtein Distance and Smith-Waterman Algorithm , 2008, 2008 3rd International Conference on Innovative Computing Information and Control.

[2]  Kenneth Ward Church,et al.  Dotplot : a program for exploring self-similarity in millions of lines of text and code , 1993 .

[3]  Benno Stein,et al.  An Evaluation Framework for Plagiarism Detection , 2010, COLING.

[4]  Naomie Salim,et al.  Fuzzy Semantic-Based String Similarity for Extrinsic Plagiarism Detection - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[5]  Jan Kasprzak,et al.  Improving the Reliability of the Plagiarism Detection System - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[6]  Justin Zobel,et al.  Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..

[7]  Thomas Gottron External Plagiarism Detection Based on Standard IR Technology and Fast Recognition of Common Subsequences - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[8]  Martin Andreas Gutbrod Nachhaltiges E-Learning durch sekundäre Dienste , 2007 .

[9]  Benno Stein,et al.  Intrinsic Plagiarism Detection , 2006, ECIR.

[10]  Fernando Llopis,et al.  A Textual-Based Similarity Approach for Efficient and Scalable External Plagiarism Analysis - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[11]  Mark Stevenson,et al.  University of Sheffield - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[12]  Norman Meuschke,et al.  Citation pattern matching algorithms for citation-based plagiarism detection: greedy citation tiling, citation chunking and longest common citation sequence , 2011, DocEng '11.