Efficient plagiarism detection for large code repositories

Unauthorized re‐use of code by students is a widespread problem in academic institutions, and raises liability issues for industry. Manual plagiarism detection is time‐consuming, and current effective plagiarism detection approaches cannot be easily scaled to very large code repositories. While there are practical text‐based plagiarism detection systems capable of working with large collections, this is not the case for code‐based plagiarism detection. In this paper, we propose techniques for detecting plagiarism in program code using text similarity measures and local alignment. Through detailed empirical evaluation on small and large collections of programs, we show that our approach is highly scalable while maintaining similar levels of effectiveness to that of the popular JPlag and MOSS systems. Copyright © 2006 John Wiley & Sons, Ltd.

[1]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[2]  Burkhard Morgenstern,et al.  DIALIGN: finding local similarities by multiple sequence alignment , 1998, Bioinform..

[3]  Justin Zobel,et al.  Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..

[4]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[5]  Andrew Trotman,et al.  Compressing Inverted Files , 2004, Information Retrieval.

[6]  Hector Garcia-Molina,et al.  Incremental updates of inverted lists for text document retrieval , 1994, SIGMOD '94.

[7]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[8]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[9]  T. Chiueh,et al.  Eecient Real-time Index Updates in Text Retrieval Systems , 1999 .

[10]  Michael J. Wise,et al.  Software for detecting suspected plagiarism: comparing structure and attribute-counting systems , 1996, ACSE '96.

[11]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[12]  Michael J. Wise,et al.  Running Karp-Rabin Matching and Greedy String Tiling , 2003 .

[13]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[14]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[15]  Nicholas Tran,et al.  Sim: a utility for detecting similarity in computer programs , 1999, SIGCSE '99.

[16]  Michael Philippsen,et al.  Finding Plagiarisms among a Set of Programs with JPlag , 2002, J. Univers. Comput. Sci..

[17]  Amanda Spink,et al.  Real life information retrieval: a study of user queries on the Web , 1998, SIGF.

[18]  Erkki Sutinen,et al.  Fast Plagiarism Detection System , 2005, SPIRE.

[19]  Seyed M. M. Tahaghoghi,et al.  Plagiarism detection across programming languages , 2006, ACSC.

[20]  Robert W. Irving Plagiarism and Collusion Detection using the Smith-Waterman Algorithm , 2004 .

[21]  Stephen E. Robertson,et al.  Okapi/Keenbow at TREC-8 , 1999, TREC.

[22]  Alistair Moffat,et al.  Searching Large Lexicons for Partially Specified Terms using Compressed Inverted Files , 1993, VLDB.

[23]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[24]  Udi Manber,et al.  Deducing Similarities in Java Sources from Bytecodes , 1998, USENIX Annual Technical Conference.

[25]  Judithe Sheard,et al.  Cheating and plagiarism: perceptions and practices of first year IT students , 2002, ITiCSE '02.

[26]  K.W. Bowyer,et al.  Experience using "MOSS" to detect cheating on programming assignments , 1999, FIE'99 Frontiers in Education. 29th Annual Frontiers in Education Conference. Designing the Future of Science and Engineering Education. Conference Proceedings (IEEE Cat. No.99CH37011.

[27]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[28]  Hugh E. Williams,et al.  In-Place versus Re-Build versus Re-Merge: Index Maintenance Strategies for Text Retrieval Systems , 2004, ACSC.

[29]  Brenda S. Baker,et al.  On finding duplication and near-duplication in large software systems , 1995, Proceedings of 2nd Working Conference on Reverse Engineering.