Differential Weight Based Hybrid Approach to Detect Software Plagiarism

In this paper we propose different representations of a source code, which attempt to highlight different aspects of a code; particularly: (i) lexical, (ii) structural, and (iii) stylistics. For the lexical view, we used levenshtein distance without considering all reserved words of the programming language. For the structural view, we proposed a similarity metric that takes into account the function’s signatures and variable declaration within a source code. The third view consists of several stylistic features, such as the number of white spaces, lines of code, upper case letters, etc. At the end, we combine these different representations in several ways. Obtained results indicate that proposed representations provide some information that allows to detect particular cases of source code re-use.