Analysis and extraction of sentence-level paraphrase sub-corpus in CS education

Since the advent of the Internet, plagiarism has become a widespread problem in student submissions. Paraphrasing is one of the several types of plagiarism employed by students to mask the original source. In this work, we construct a sub-corpus of paraphrased sentences by extracting all lightly and heavily revised sentences from the Corpus of Plagiarized Short Answers, using modified criteria for sentences. We then apply document similarity measures on this sub-corpus and derive some interesting features of this sub-corpus. Our findings suggest that this sub-corpus is more suited for testing paraphrase detection techniques by providing sentence-level paraphrasing samples instead of the file-level classification provided in the original corpus. Additional sentence samples may also be added to this sub-corpus to achieve variety and scale.

[1]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[2]  A. Lenhart,et al.  The digital revolution and higher education , 2011 .

[3]  James A. Malcolm,et al.  Detecting Short Passages of Similar Text in Large Document Collections , 2001, EMNLP.

[4]  Benno Stein,et al.  An Evaluation Framework for Plagiarism Detection , 2010, COLING.

[5]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[6]  Lucia Specia,et al.  Using Natural Language Processing for Automatic Detection of Plagiarism , 2010 .

[7]  Mark Stevenson,et al.  Developing a corpus of plagiarised short answers , 2011, Lang. Resour. Evaluation.

[8]  Naomie Salim,et al.  Understanding Plagiarism Linguistic Patterns, Textual Features, and Detection Methods , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[9]  Hermann A. Maurer,et al.  Plagiarism - A Survey , 2006, J. Univers. Comput. Sci..

[10]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).