Plagiarism Detection in Computer Programming Using Feature Extraction From Ultra-Fine-Grained Repositories

Detecting instances of plagiarism in student homework, especially programming homework, is an important issue for practitioners. In the past decades, several tools have emerged that are able to effectively compare large corpora of homeworks and sort pairs by degree of similarity. However, those tools are available to students as well, allowing them to experiment and develop elaborate methods for evading detection. Also, such tools are unable to detect instances of “external plagiarism” where students obtained unethical help from sources not among other students of the same course. One way to battle this problem is to monitor student activity while solving their homeworks using a cloud-based integrated development environment (IDE) and detect suspicious behaviours. Each editing event in program source can be stored as a new commit to create a form of ultra-fine-grained source code repository. In this paper, the authors propose several new features that can be extracted from such repositories with the purpose of building a comprehensive profile of each individual developer. Machine learning techniques were used to detect suspicious behaviours, which allowed the authors to significantly improve upon the performance of more traditional plagiarism detection tools.

[1]  Jurriaan Hage,et al.  A comparison of plagiarism detection tools , 2010 .

[2]  Stas Negara,et al.  Is It Dangerous to Use Version Control Histories to Study Source Code Evolution? , 2012, ECOOP.

[3]  Gareth J. F. Jones,et al.  Retrieving and classifying instances of source code plagiarism , 2018, Information Retrieval Journal.

[4]  S. K. Robinson,et al.  An empirical approach for detecting program similarity and plagiarism within a university programming environment , 1987 .

[5]  Christian Igel,et al.  Improving the Rprop Learning Algorithm , 2000 .

[6]  Michelle Craig,et al.  Plagiarism detection using feature-based neural networks , 2007, SIGCSE.

[7]  Stas Negara,et al.  Mining fine-grained code changes to detect unknown change patterns , 2014, ICSE.

[8]  Z. Duric,et al.  A Source Code Similarity System for Plagiarism Detection , 2013, Comput. J..

[9]  Mohamed El Bachir Menai,et al.  Similarity detection in Java programming assignments , 2010, 2010 5th International Conference on Computer Science & Education.

[10]  Michael Philippsen,et al.  Finding Plagiarisms among a Set of Programs with JPlag , 2002, J. Univers. Comput. Sci..

[11]  Maxim Mozgovoy Enhancing Computer-Aided Plagiarism Detection , 2008 .

[12]  Nicholas Tran,et al.  Sim: a utility for detecting similarity in computer programs , 1999, SIGCSE '99.

[13]  Xin Chen,et al.  Shared information and program plagiarism detection , 2004, IEEE Transactions on Information Theory.

[14]  Romain Robbes,et al.  Changes as First-Class Citizens , 2017, ACM Comput. Surv..

[15]  Thomas Fritz,et al.  Collecting and Processing Interaction Data for Recommendation Systems , 2014, Recommendation Systems in Software Engineering.

[16]  Abraham Bernstein,et al.  Detecting Plagiarism Based on the Creation Process , 2016, IEEE Transactions on Learning Technologies.

[17]  Oscar Karnalim,et al.  Automated Hints Generation for Investigating Source Code Plagiarism and Identifying The Culprits on In-Class Individual Programming Assessment , 2019, Comput..

[18]  Steven David,et al.  Source Code Authorship Attribution , 2010 .

[19]  Pedro Rangel Henriques,et al.  Plagiarism Detection: A Tool Survey and Comparison , 2014, SLATE.

[20]  Simon,et al.  Syntax Trees and Information Retrieval to Improve Code Similarity Detection , 2020, ACE.

[21]  Brad A. Myers,et al.  Capturing and analyzing low-level events from the code editor , 2011, PLATEAU '11.

[22]  Alexander Binder,et al.  Evaluating the Visualization of What a Deep Neural Network Has Learned , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[23]  Peter Vamplew,et al.  An Anti-Plagiarism Editor for Software Development Courses , 2005, ACE.

[24]  Vedran Ljubovic,et al.  Improving Plagiarism Detection Using Genetic Algorithm , 2019, 2019 42nd International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO).

[25]  Harald C. Gall,et al.  Comparing fine-grained source code changes and code churn for bug prediction , 2011, MSR '11.

[26]  Chanchal Kumar Roy,et al.  Comparison and evaluation of code clone detection techniques and tools: A qualitative approach , 2009, Sci. Comput. Program..

[27]  Giuliano Antoniol,et al.  Comparison and Evaluation of Clone Detection Tools , 2007, IEEE Transactions on Software Engineering.

[28]  Romain Robbes,et al.  Replaying IDE interactions to evaluate and improve change prediction approaches , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).

[29]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[30]  Oscar Karnalim,et al.  Source Code Plagiarism Detection in Academia with Information Retrieval: Dataset and the Observation , 2019, Informatics Educ..

[31]  Dilip Kumar Sharma,et al.  A state of art on source code plagiarism detection , 2016, 2016 2nd International Conference on Next Generation Computing Technologies (NGCT).

[32]  Greg Wilson,et al.  Mining student CVS repositories for performance indicators , 2005, MSR.

[33]  David C. Noelle,et al.  Automated Plagiarism Detection for Computer Programming Exercises Based on Patterns of Resubmission , 2018, ICER.