Plagiarism detection using stopword n-grams

In this paper a novel method for detecting plagiarized passages in document collections is presented. In contrast to previous work in this field that uses content terms to represent documents, the proposed method is based on a small list of stopwords (i.e., very frequent words). We show that stopword n-grams reveal important information for plagiarism detection since they are able to capture syntactic similarities between suspicious and original documents and they can be used to detect the exact plagiarized passage boundaries. Experimental results on a publicly available corpus demonstrate that the performance of the proposed approach is competitive when compared with the best reported results. More importantly, it achieves significantly better results when dealing with difficult plagiarism cases where the plagiarized passages are highly modified and most of the words or phrases have been replaced with synonyms. © 2011 Wiley Periodicals, Inc.

[1]  Efstathios Stamatatos,et al.  Intrinsic Plagiarism Detection Using Character n-gram Profiles , 2009 .

[2]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[3]  Emanuele Caglioti,et al.  A plagiarism detection procedure in three steps: Selection, matches and squares , 2009 .

[4]  Benno Stein,et al.  An Evaluation Framework for Plagiarism Detection , 2010, COLING.

[5]  Ophir Frieder,et al.  Collection statistics for fast duplicate document detection , 2002, TOIS.

[6]  W. Bruce Croft,et al.  Local text reuse detection , 2008, SIGIR '08.

[7]  Justin Zobel,et al.  Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..

[8]  James A. Malcolm,et al.  Detecting Short Passages of Similar Text in Large Document Collections , 2001, EMNLP.

[9]  Hector Garcia-Molina,et al.  SCAM: A Copy Detection Mechanism for Digital Documents , 1995, DL.

[10]  Boris Katz,et al.  Using Syntactic Information to Identify Plagiarism , 2005 .

[11]  Ido Dagan,et al.  Feature instability as a criterion for selecting potential style markers , 2006, J. Assoc. Inf. Sci. Technol..

[12]  Zhang Ling,et al.  A Cluster-Based Plagiarism Detection Method - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[13]  C. E. Veni Madhavan,et al.  Stopword Graphs and Authorship Attribution in Text Corpora , 2009, 2009 IEEE International Conference on Semantic Computing.

[14]  Xuanjing Huang,et al.  Efficient partial-duplicate detection based on sequence matching , 2010, SIGIR.

[15]  Benno Stein,et al.  Intrinsic plagiarism analysis , 2011, Lang. Resour. Evaluation.

[16]  Bill N. Schilit,et al.  Generating links by mining quotations , 2008, Hypertext.

[17]  Benno Stein,et al.  Cross-language plagiarism detection , 2011, Lang. Resour. Evaluation.

[18]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[19]  Kenneth Ward Church,et al.  Dotplot : a program for exploring self-similarity in millions of lines of text and code , 1993 .

[20]  Paul Clough,et al.  Old and new challenges in automatic plagiarism detection , 2003 .

[21]  Efstathios Stamatatos,et al.  Text Genre Detection Using Common Word Frequencies , 2000, COLING.

[22]  Roman Kern,et al.  External and Intrinsic Plagiarism Detection Using a Cross-Lingual Retrieval and Segmentation System - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[23]  Andreas Paepcke,et al.  SpotSigs: robust and efficient near duplicate detection in large web collections , 2008, SIGIR '08.

[24]  Mark Stevenson,et al.  Developing a corpus of plagiarised short answers , 2011, Lang. Resour. Evaluation.

[25]  Alberto Barrón-Cedeño,et al.  On Automatic Plagiarism Detection Based on n-Grams Comparison , 2009, ECIR.

[26]  Jan Kasprzak,et al.  Improving the Reliability of the Plagiarism Detection System - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[27]  I. Dan Melamed,et al.  Bitext Maps and Alignment via Pattern Recognition , 1999, CL.

[28]  Cristian Grozea,et al.  ENCOPLOT: Pairwise Sequence Matching in Linear Time Applied to Plagiarism Detection ∗ , 2009 .

[29]  Chris Fox,et al.  The Influence of Text Pre-processing on Plagiarism Detection , 2009, RANLP.

[30]  Stuart Hannabuss,et al.  Contested texts: issues of plagiarism , 2001 .

[31]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[32]  William John Teahan,et al.  A repetition based measure for verification of text collections and for text categorization , 2003, SIGIR.

[33]  W. Bruce Croft,et al.  Similarity measures for tracking information flow , 2005, CIKM '05.

[34]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[35]  Benno Stein,et al.  Near Similarity Search and Plagiarism Analysis , 2005, GfKl.

[36]  W. Bruce Croft,et al.  Finding text reuse on the web , 2009, WSDM '09.

[37]  Maria Soledad Pera,et al.  Nowhere to Hide: Finding Plagiarized Documents Based on Sentence Similarity , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.