Detecting Text Reuse with Modified and Weighted N-grams

Text reuse is common in many scenarios and documents are often based, at least in part, on existing documents. This paper reports an approach to detecting text reuse which identifies not only documents which have been reused verbatim but is also designed to identify cases of reuse when the original has been rewritten. The approach identifies reuse by comparing word n-grams in documents and modifies these (by substituting words with synonyms and deleting words) to identify when text has been altered. The approach is applied to a corpus of newspaper stories and found to outperform a previously reported method.

[1]  Chris Callison-Burch,et al.  Syntactic Constraints on Paraphrases Extracted from Parallel Corpora , 2008, EMNLP.

[2]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[3]  Yorick Wilks,et al.  The METER corpus : a corpus for analysing journalistic text reuse , 2001 .

[4]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[5]  Matthias Hagen,et al.  Overview of the 1st international competition on plagiarism detection , 2009 .

[6]  Alberto Barrón-Cedeño,et al.  Reducing the Plagiarism Detection Search Space on the Basis of the Kullback-Leibler Distance , 2009, CICLing.

[7]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[8]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[9]  Mark Stevenson,et al.  The Reuters Corpus Volume 1 -from Yesterday’s News to Tomorrow’s Language Resources , 2002, LREC.

[10]  W. Bruce Croft,et al.  Evaluating text reuse discovery on the web , 2010, IIiX.

[11]  Hector Garcia-Molina,et al.  SCAM: A Copy Detection Mechanism for Digital Documents , 1995, DL.

[12]  C. Lyon,et al.  Demonstration of the Ferret Plagiarism Detector , 2006 .

[13]  W. Bruce Croft,et al.  Local text reuse detection , 2008, SIGIR '08.

[14]  A. Bell The language of news media , 1991 .

[15]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[16]  Justin Zobel,et al.  Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..