Schemes for Labeling Semantic Code Clones using Machine Learning

Machine learning approaches built to identify code clones fail to perform well due to insufficient training samples and have been restricted only up to Type-III clones. A majority of the publicly available code clone corpora are incomplete in nature and lack labeled samples for semantic or Type-IV clones. We present here two schemes for labeling all types of clones including Type-IV clones. We restrict our study to Java code only. First, we use an unsupervised approach to label Type-IV clones and validate them using expert Java programmers. Next, we present a supervised scheme for labeling (or classifying) unknown samples based on labeled samples derived from our first scheme. We evaluate the performance of our schemes using six well-known Java code clone corpora and report on the quality of produced clones in terms of kappa agreement, mean error and accuracy scores. Results show that both schemes produce high quality code clones facilitating future use of machine learning in detecting clones of Type-IV.

[1]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[2]  Laurie Hendren,et al.  Soot: a Java bytecode optimization framework , 2010, CASCON.

[3]  Thierry Lavoie,et al.  Automated type-3 clone oracle using Levenshtein metric , 2011, IWSC '11.

[4]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[5]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[6]  Shinji Kusumoto,et al.  A dataset of clone references with gaps , 2014, MSR 2014.

[7]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1984, TOPL.

[8]  Chanchal Kumar Roy,et al.  Towards a Big Data Curated Benchmark of Inter-project Code Clones , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[9]  A. Viera,et al.  Understanding interobserver agreement: the kappa statistic. , 2005, Family medicine.

[10]  Wei Le,et al.  A code clone oracle , 2014, MSR 2014.

[11]  Cristina V. Lopes,et al.  SourcererCC and SourcererCC-I: Tools to Detect Clones in Batch Mode and during Software Development , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C).

[12]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[13]  Shruti Jadon,et al.  Code clones detection using machine learning technique: Support vector machine , 2016, 2016 International Conference on Computing, Communication and Automation (ICCCA).

[14]  Shinji Kusumoto,et al.  Classification model for code clones based on machine learning , 2015, Empirical Software Engineering.

[15]  Jugal K. Kalita,et al.  Semantic Clone Detection Using Machine Learning , 2016, 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA).

[16]  Juan José Rodríguez Diez,et al.  Rotation Forest: A New Classifier Ensemble Method , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Martin White,et al.  Deep learning code fragments for code clone detection , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[18]  Paola Batistoni,et al.  International Conference , 2001 .

[19]  Maninder Singh,et al.  Software clone detection: A systematic review , 2013, Inf. Softw. Technol..

[20]  Giuliano Antoniol,et al.  Comparison and Evaluation of Clone Detection Tools , 2007, IEEE Transactions on Software Engineering.

[21]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[22]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.