Plagiarism Detection in Source Programs Using Structural Similarities

The paper presents a plagiarism detection framework the goal of which is to determine whether two programs are similar to each other, and if so, to what extent. The issue of plagiarism detection has been considered earlier for written material, such as student essays. For these, text-based algorithms have been published. We argue that in case of program code comparison, structure based techniques may be much more suitable. The main idea is to transform the source code into mathematical objects, use appropriate reduction and comparison methods on these, and interpret the results appropriately. We have designed a generic program structure comparison framework and implemented it for the Prolog and SML programming languages. We have been using the implementation at BUTE to successfully detect plagiarism in homework assignments for years.

[1]  George,et al.  Computer Algorithms for Plagiarism Detection , 1989 .

[2]  S. K. Robinson,et al.  An empirical approach for detecting program similarity and plagiarism within a university programming environment , 1987 .

[3]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[4]  Michael J. Wise,et al.  YAP3: improved detection of similarities in computer program and other texts , 1996, SIGCSE '96.

[5]  Daniel S. Hirschberg,et al.  Algorithms for the Longest Common Subsequence Problem , 1977, JACM.

[6]  Stéphane Ducasse,et al.  Insights into system-wide code duplication , 2004, 11th Working Conference on Reverse Engineering.

[7]  David Eppstein,et al.  The Polyhedral Approach to the Maximum Planar Subgraph Problem: New Chances for Related Problems , 1994, GD.

[8]  A. Jovanovic,et al.  A new algorithm for solving the tree isomorphism problem , 2005, Computing.

[9]  Udi Manber,et al.  Deducing Similarities in Java Sources from Bytecodes , 1998, USENIX Annual Technical Conference.

[10]  Rainer Koschke,et al.  Clone Detection Using Abstract Syntax Suffix Trees , 2006, 2006 13th Working Conference on Reverse Engineering.

[11]  Allison Martin,et al.  Internet Plagiarism: A Teacher's Combat Guide , 2001 .

[12]  Brenda S. Baker,et al.  A theory of parameterized pattern matching: algorithms and applications , 1993, STOC.

[13]  Maxime Crochemore,et al.  A fast and practical bit-vector algorithm for the Longest Common Subsequence problem , 2001, Inf. Process. Lett..

[14]  Seo-Young Noh An XML Plagiarism Detection Model for Procedural Programming Languages , 2003 .

[15]  Anas N. Al-Rabadi,et al.  A comparison of modified reconstructability analysis and Ashenhurst‐Curtis decomposition of Boolean functions , 2004 .

[16]  Hector Garcia-Molina,et al.  The SCAM Approach to Copy Detection in Digital Libraries , 1995, D Lib Mag..

[17]  G. Whale Indentification of Program Similarity in Large Populations , 1990, Comput. J..

[18]  Seo-Young Noh,et al.  A Lightweight Program Similarity Detection Model using XML and Levenshtein Distance , 2006, FECS.

[19]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[20]  Paul Heckel,et al.  A technique for isolating differences between files , 1978, CACM.

[21]  Vikraman Arvind,et al.  Graph isomorphism is in SPP , 2002, The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings..