A Hybrid Method for Detecting Source-code Plagiarism in Computer Programming Courses

The paper presents a hybrid method for detecting source-code plagiarism in computer programming courses. For many programming courses, the students’ assignments are in the form of electronic source files and it is difficult for the teacher to manually detect the plagiarisms among the assignments. Our system can compare two source files automatically and help to solve this problem. The principle of the system is summarized: Firstly, the source files are processed with intension of filtering the noise elements such as header file include statements, comments, input/output statements and string literals. Secondly, a feature-based detection component is proposed. For each source file, a feature vector is generated which include physic metrics such as the number of source lines and the number of total words, Halstead metrics such as the statistics of source code operators, execution flows and operands. Then the distance between two feature vectors is computed which is considered as the similarity between the corresponding source files. Thirdly, a structure-based detection component is proposed. For each source file, the source code is transformed into a sequence of well-defined tokens. Then to improve the computational efficiency each token is transformed into a single character using a mapping table. The LCS (Longest Common Subsequence) is computed for each two strings, which is considered as the similarity between the corresponding source files. Lastly, an integration component is proposed which uses a two-stage strategy to combine the above two separate components into one complete system. Experimental results show that our system can effectively spot the suspect program copies, even when they have some kind of minor modifications.