Measuring similarity between Karel programs using character and word n-grams

We present a method for measuring similarity between source codes. We approach this task from the machine learning perspective using character and word n-grams as features and examining different machine learning algorithms. Furthermore, we explore the contribution of the latent semantic analysis in this task. We developed a corpus in order to evaluate the proposed approach. The corpus consists of around 10,000 source codes written in the Karel programming language to solve 100 different tasks. The results show that the highest classification accuracy is achieved when using Support Vector Machines classifier, applying the latent semantic analysis, and selecting as features trigrams of words.

[1]  Maurice H. Halstead,et al.  Elements of software science (Operating and programming systems series) , 1977 .

[2]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[3]  Richard E. Pattis,et al.  Karel the Robot: A Gentle Introduction to the Art of Programming , 1994 .

[4]  Grigori Sidorov,et al.  Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model , 2014, Computación y Sistemas.

[5]  Grigori Sidorov,et al.  Syntactic N-grams as Features for the Author Profiling Task: Notebook for PAN at CLEF 2015 , 2015, CLEF.

[6]  Shi-Jen Lin,et al.  A Block-Structured Model for Source Code Retrieval , 2011, ACIIDS.

[7]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[8]  Maurice H. Halstead,et al.  Elements of software science , 1977 .

[9]  Michael J. Wise,et al.  YAP3: improved detection of similarities in computer program and other texts , 1996, SIGCSE '96.

[10]  Georgina Cosma,et al.  An Approach to Source-Code Plagiarism Detection and Investigation Using Latent Semantic Analysis , 2012, IEEE Transactions on Computers.

[11]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[12]  Nicholas Tran,et al.  Sim: a utility for detecting similarity in computer programs , 1999, SIGCSE '99.

[13]  Grigori Sidorov,et al.  A Graph Based Authorship Identification Approach: Notebook for PAN at CLEF 2015 , 2015, CLEF.

[14]  Grigori Sidorov,et al.  Should Syntactic N-grams Contain Names of Syntactic Relations? , 2014, Int. J. Comput. Linguistics Appl..

[15]  Anas N. Al-Rabadi,et al.  A comparison of modified reconstructability analysis and Ashenhurst‐Curtis decomposition of Boolean functions , 2004 .

[16]  Helena Gómez-Adorno,et al.  Computing text similarity using Tree Edit Distance , 2015, 2015 Annual Conference of the North American Fuzzy Information Processing Society (NAFIPS) held jointly with 2015 5th World Conference on Soft Computing (WConSC).