A Diversified Feature Extraction Approach for Program Similarity Analysis

As code plagiarism becomes more and more prevalent, the need for code similarity detection technology is growing greatly. The feature of program is the basic unit that can represent the procedure and structure. Therefore, the quality of the feature will directly impact the accuracy of the similarity detection results. In this paper, we propose a diversified feature extraction approach, which extracts feature information from attribute counting, statement structure, program structure and program function. In the process of feature extraction, we comprehensively consider multiple factors of program, such as program structure, semantics and data flow. Evaluation results shows that this approach can eliminate the interference caused by multiple plagiarism methods, and it also has certain improvement in accuracy and detection efficiency.