An empirical study on optimization of training dataset in harmfulness prediction of code clone using ensemble feature selection model

In order to solve the problem of irrelevant features and imbalanced data classification in the process of clone code harmfulness prediction, an integrated classifier algorithm based on RUS (Random Under Sampling) and Wrapper was proposed. Firstly, the majority of samples in training dataset were re-sampled into several proportional minority class data set, which were combined with minority samples to create multiple different training sample subsets; Then, a sequential floating forward search algorithm based Wrapper was proposed to select optimal feature subsets; The different proportions of training subsets were mapped with the corresponding optimal feature subsets; Finally, random forest classifier was used to evaluate the acquired optimized training dataset. The experimental results showed that this integrated classifier algorithm applied to code clone harmfulness prediction increased average about 7% in accuracy, F1 measure and AUC evaluation index. And compared with four other similar optimization methods, the AUC value of integrated classifier algorithm was increased by 10.3%, which expressed the feasibility and effectiveness of the ensemble feature selection model.

[1]  Taghi M. Khoshgoftaar,et al.  Feature Selection with Imbalanced Data for Software Defect Prediction , 2009, 2009 International Conference on Machine Learning and Applications.

[2]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[3]  Huan Liu,et al.  Feature selection for classification: A review , 2014 .

[4]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[5]  Elmar Jürgens,et al.  Do code clones matter? , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[6]  Zhang Liping,et al.  Extract Function Clone Genealogies across Multiple Versions , 2015 .

[7]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[8]  Thierry Lavoie,et al.  Uncovering access control weaknesses and flaws with security-discordant software clones , 2013, ACSAC.

[9]  Minhaz Fahim Zibran,et al.  A Comparative Study on Vulnerabilities in Categories of Clones and Non-cloned Code , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[10]  Daniela Steidl,et al.  Feature-based detection of bugs in clones , 2013, 2013 7th International Workshop on Software Clones (IWSC).

[11]  Krzysztof Czarnecki,et al.  An Exploratory Study of Cloning in Industrial Software Product Lines , 2013, 2013 17th European Conference on Software Maintenance and Reengineering.

[12]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.