Is Bigger Data Better for Defect Prediction: Examining the Impact of Data Size on Supervised and Unsupervised Defect Prediction

Defect prediction could help software practitioners to predict the future occurrence of bugs in the software code regions. In order to improve the accuracy of defect prediction, dozens of supervised and unsupervised methods have been put forward and achieved good results in this field. One limiting factor of defect prediction is that the data size of defect data is not big, which restricts the scope of application with defect prediction models. In this study, we try to construct bigger defect datasets by merging available datasets with the same measurement dimension and check whether bigger data will bring better defect prediction performance with supervised and unsupervised models or not. The results of our experiment reveal that larger-scale dataset doesn’t bring improvements of both supervised and unsupervised classifiers.

[1]  Jun Wang,et al.  Compressed C4.5 Models for Software Defect Prediction , 2012, 2012 12th International Conference on Quality Software.

[2]  Yuming Zhou,et al.  Effort-aware just-in-time defect prediction: simple unsupervised models could be better than supervised models , 2016, SIGSOFT FSE.

[3]  Harald C. Gall,et al.  Cross-project defect prediction: a large scale experiment on data vs. domain vs. process , 2009, ESEC/SIGSOFT FSE.

[4]  Hareton K. N. Leung,et al.  An in-depth study of the potentially confounding effect of class size in fault prediction , 2014, TSEM.

[5]  Tim Menzies,et al.  Heterogeneous Defect Prediction , 2018, IEEE Trans. Software Eng..

[6]  Hoh Peter In,et al.  Micro interaction metrics for defect prediction , 2011, ESEC/FSE '11.

[7]  Lionel C. Briand,et al.  A systematic and comprehensive investigation of methods to build and evaluate fault prediction models , 2010, J. Syst. Softw..

[8]  Jingyi Zhang,et al.  Classifying Python Code Comments Based on Supervised Learning , 2018, WISA.

[9]  A. Scott,et al.  A Cluster Analysis Method for Grouping Means in the Analysis of Variance , 1974 .

[10]  Ahmed E. Hassan,et al.  Prioritizing the devices to test your app on: a case study of Android game apps , 2014, SIGSOFT FSE.

[11]  Song Wang,et al.  Automatically Learning Semantic Features for Defect Prediction , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[12]  Yuming Zhou,et al.  How Far We Have Progressed in the Journey? An Examination of Cross-Project Defect Prediction , 2018, ACM Trans. Softw. Eng. Methodol..

[13]  Jin Liu,et al.  Dictionary learning based software defect prediction , 2014, ICSE.

[14]  F. Wilcoxon,et al.  Individual comparisons of grouped data by ranking methods. , 1946, Journal of economic entomology.

[15]  L. Erlikh,et al.  Leveraging legacy system dollars for e-business , 2000 .

[16]  Tim Menzies,et al.  Active learning and effort estimation: Finding the essential content of software effort estimation data , 2013, IEEE Transactions on Software Engineering.

[17]  Khaled El Emam,et al.  The Confounding Effect of Class Size on the Validity of Object-Oriented Metrics , 2001, IEEE Trans. Software Eng..

[18]  Lefteris Angelis,et al.  Ranking and Clustering Software Cost Estimation Models through a Multiple Comparisons Algorithm , 2013, IEEE Transactions on Software Engineering.

[19]  Andreas Zeller,et al.  Predicting faults from cached history , 2008, ISEC '08.

[20]  Premkumar T. Devanbu,et al.  How, and why, process metrics are better , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[21]  V. Suma,et al.  A novel method for software defect prediction: Hybrid of FCM and random forest , 2014, 2014 International Conference on Electronics and Communication Systems (ICECS).

[22]  Amjad Hudaib,et al.  Software Defect Prediction using Feature Selection and Random Forest Algorithm , 2017, 2017 International Conference on New Trends in Computing Sciences (ICTCS).

[23]  Shane McIntosh,et al.  Revisiting the Impact of Classification Techniques on the Performance of Defect Prediction Models , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[24]  Yuming Zhou,et al.  Empirical analysis of network measures for effort-aware fault-proneness prediction , 2016, Inf. Softw. Technol..