Compressed C4.5 Models for Software Defect Prediction

Defects in every software must be handled properly, and the number of defects directly reflects the quality of a software. In recent years, researchers have applied data mining and machine learning methods to predicting software defects. However, in their studies, the method in which the machine learning models are directly adopted may not be precise enough. Optimizing the machine learning models used in defects prediction will improve the prediction accuracy. In this paper, aiming at the characteristics of the metrics mined from the open source software, we proposed three new defect prediction models based on C4.5 model. The new models introduce the Spearman's rank correlation coefficient to the basis of choosing root node of the decision tree which makes the models better on defects prediction. In order to verify the effectiveness of the improved models, an experimental scheme is designed. In the experiment, we compared the prediction accuracies of the existing models and the improved models and the result showed that the improved models reduced the size of the decision tree by 49.91% on average and increased the prediction accuracy by 4.58% and 4.87% on two modules used in the experiment.