Old and new challenges in automatic plagiarism detection

Automatic methods of measuring similarity between program code and natural language text pairs have been used for many years to assist humans in detecting plagiarism. For example, over the past thirty years or so, a vast number of approaches have been proposed for detecting likely plagiarism between programs written by Computer Science students. However, more recently, approaches to identifying similarities between natural language texts have been addressed, but given the ambiguity and complexity of natural over program languages, this task is very difficult. Automatic detection is gaining further interest from both the academic and commercial worlds given the ease with which texts can now be found, copied and rewritten. Following the recent increase in the popularity of on-line services offering plagiarism detection services and the increased publicity surrounding cases of plagiarism in academia and industry, this paper explores the nature of the plagiarism problem, and in particular summarise the approaches used so far for its detection. I focus on plagiarism detection in natural language, and discuss a number of methods I have used to measure text reuse. I end by suggesting a number of recommendations for further work in the field of automatic plagiarism detection.

[1]  James A. Malcolm,et al.  Detecting Short Passages of Similar Text in Large Document Collections , 2001, EMNLP.

[2]  Hugo T. Jankowitz Detecting Plagiarism in Student Pascal Programs , 1988, Comput. J..

[3]  Susan Horwitz,et al.  Identifying the semantic and textual differences between two versions of a program , 1990, PLDI '90.

[4]  Pamela Samuelson,et al.  Self-plagiarism or fair use , 1994, CACM.

[5]  Gregory W. Hislop,et al.  Analyzing existing software for software reuse , 1998, J. Syst. Softw..

[6]  James O. Hamblen,et al.  Computer algorithms for plagiarism detection , 1989 .

[7]  Fintan Culwin,et al.  A REVIEW OF ELECTRONIC SERVICES FOR PLAGIARISM DETECTION IN STUDENT SUBMISSIONS , 2000 .

[8]  K. J. Ottenstein An algorithmic approach to the detection and prevention of plagiarism , 1976, SGCS.

[9]  Graeme Hirst,et al.  Detecting Stylistic Inconsistencies in Collaborative Writing , 1996, The New Writing Environment.

[10]  Jonathan Helfman,et al.  Dotplot Patterns: A Literal Look at Pattern Languages , 1996, Theory Pract. Object Syst..

[11]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[12]  Paul Clough,et al.  Plagiarism in natural and programming languages: an overview of current tools and technologies , 2000 .

[13]  Hector Garcia-Molina,et al.  Building a scalable and accurate copy detection mechanism , 1996, DL '96.

[14]  Robert J. Gaizauskas,et al.  Building and annotating a corpus for the study of journalistic text reuse , 2002, LREC.

[15]  Stuart Hannabuss,et al.  Contested texts: issues of plagiarism , 2001 .

[16]  Ann-Marie Lancaster,et al.  A plagiarism detection system , 1981, SIGCSE '81.

[17]  Michael J. Wise Detection of similarities in student programs: YAP'ing may be preferable to plague'ing , 1992, SIGCSE '92.

[18]  S. K. Robinson,et al.  An empirical approach for detecting program similarity and plagiarism within a university programming environment , 1987 .

[19]  Thomas Mallon,et al.  Stolen Words: Forays into the Origins and Ravages of Plagiarism , 1989 .

[20]  Allan Bell TEXT, TIME AND TECHNOLOGY IN NEWS ENGLISH , 1997 .

[21]  Stéphane Ducasse,et al.  A language independent approach for detecting duplicated code , 1999, Proceedings IEEE International Conference on Software Maintenance - 1999 (ICSM'99). 'Software Maintenance for Business Change' (Cat. No.99CB36360).

[22]  Michael S. Waterman Time warps, string edits, and macromolecules: The theory and practice of sequence comparison : David Sankoff and Joseph B. Kruskal, Editors, Addison Wesley Publishing Company, 1983, 382 pp., $31.95 hardback , 1985 .

[23]  G. Whale Indentification of Program Similarity in Large Populations , 1990, Comput. J..

[24]  Lutz Prechelt,et al.  JPlag: Finding plagiarisms among a set of programs , 2000 .

[25]  Michael J. Wise,et al.  YAP3: improved detection of similarities in computer program and other texts , 1996, SIGCSE '96.

[26]  Harold L. Somers,et al.  An Attempt to Use Weighted Cusums to Identify Sublanguages , 1998, CoNLL.

[27]  Fintan Culwin,et al.  Visualising intra-corpal plagiarism , 2001, Proceedings Fifth International Conference on Information Visualisation.

[28]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[29]  David Woolls,et al.  Tools for the Trade , 1998 .

[30]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[31]  A. Q. Morton,et al.  Analysing for authorship : a guide to the cusum technique , 1996 .

[32]  Eric Atwell,et al.  Customising a Copying-Identifier for Biomedical Science Student Reports: Comparing Simple and Smart Analyses , 2002, AICS.

[33]  Michael Luck,et al.  Plagiarism in programming assignments , 1999 .

[34]  Kenneth Ward Church,et al.  Dotplot : a program for exploring self-similarity in millions of lines of text and code , 1993 .

[35]  Nicholas Tran,et al.  Sim: a utility for detecting similarity in computer programs , 1999, SIGCSE '99.

[36]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[37]  David Sharp,et al.  Technical Review of Plagiarism Detection Software Report , 2001 .

[38]  Hubert E. Dunsmore Software metrics: An overview of an evolving methodology , 1984, Inf. Process. Manag..

[39]  La Follette,et al.  Stealing into print : fraud, plagiarism, and misconduct in scientific publishing , 1992 .

[40]  Shelley Angelil-Carter,et al.  Stolen Language?: Plagiarism in Writing , 2000 .

[41]  Michael J. Wise,et al.  Running Karp-Rabin Matching and Greedy String Tiling , 2003 .

[42]  Masaki Murata,et al.  Automatic extraction of differences between spoken and written languages, and automatic translation from the written to the spoken language , 2002, LREC.

[43]  Janet Osen The cream of other men's wit: Plagiarism and misappropriation in cyberspace , 1997 .

[44]  Yorick Wilks,et al.  Measuring Text Reuse , 2002, ACL.

[45]  Michael J. Wise,et al.  Software for detecting suspected plagiarism: comparing structure and attribute-counting systems , 1996, ACSE '96.

[46]  Brenda S. Baker Parameterized Pattern Matching: Algorithms and Applications , 1996, J. Comput. Syst. Sci..

[47]  A. Bell The language of news media , 1991 .

[48]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.