Software change‐proneness prediction through combination of bagging and resampling methods

Identifying the change‐prone parts of software could help managers and developers to effectively allocate maintenance resource and time during early phases of software life cycle. Change‐proneness prediction on file level with binary classification methods makes such identification possible. As the fact that change‐prone files frequently account for a small part of all the files, the prediction performance of standard classification methods is not satisfying. In this paper, we employ imbalanced learning methods, including bagging, resampling, and especially their combination to reduce the performance decrease of standard classifiers caused by the class imbalance problem in change‐proneness prediction. Besides, we propose a boxplot‐based partition method to provide more proper change‐proneness label designation for the training data. Eight open‐source Java projects are chosen in the empirical study to validate the effectiveness of the combination methods in change‐proneness prediction. The experimental results of the empirical study show that combining bagging with resampling can significantly improve the prediction performance of only bagging or resampling. Of all the combination methods employed, combination of bagging with undersampling performs better than others. And support vector machine is more effective as a base classifier than J48 and naive Bayes.

[1]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[2]  Doo-Hwan Bae,et al.  Measuring behavioral dependency for improving change-proneness prediction in UML-based design models , 2010, J. Syst. Softw..

[3]  Carl G. Davis,et al.  A Hierarchical Model for Object-Oriented Design Quality Assessment , 2002, IEEE Trans. Software Eng..

[4]  Stan Matwin,et al.  Feature Engineering for Text Classification , 1999, ICML.

[5]  Shane McIntosh,et al.  Revisiting the Impact of Classification Techniques on the Performance of Defect Prediction Models , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[6]  Yann-Gaël Guéhéneuc,et al.  An empirical study of the relationships between design pattern roles and class change proneness , 2008, 2008 IEEE International Conference on Software Maintenance.

[7]  Mahmoud O. Elish,et al.  A suite of metrics for quantifying historical changes to predict future change‐prone classes in object‐oriented software , 2013, J. Softw. Evol. Process..

[8]  Anil K. Jain,et al.  Classification of text documents , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[9]  Tong-Seng Quah,et al.  Application of neural networks for software quality prediction using object-oriented metrics , 2003, International Conference on Software Maintenance, 2003. ICSM 2003. Proceedings..

[10]  Hongfang Liu,et al.  Identifying and characterizing change-prone classes in two large-scale open-source products , 2007, J. Syst. Softw..

[11]  Charles X. Ling,et al.  AUC: A Better Measure than Accuracy in Comparing Learning Algorithms , 2003, Canadian Conference on AI.

[12]  Andrea De Lucia,et al.  Enhancing change prediction models using developer-related factors , 2018, J. Syst. Softw..

[13]  Sallie M. Henry,et al.  Object-oriented metrics that predict maintainability , 1993, J. Syst. Softw..

[14]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[15]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[16]  Taghi M. Khoshgoftaar,et al.  Improving Software-Quality Predictions With Data Sampling and Boosting , 2009, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[17]  Tracy Hall,et al.  Software defect prediction: do different classifiers find the same defects? , 2017, Software Quality Journal.

[18]  C. van Koten,et al.  An application of Bayesian network for predicting object-oriented software maintainability , 2006, Inf. Softw. Technol..

[19]  Geoff Holmes,et al.  Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003, IEEE Trans. Knowl. Data Eng..

[20]  Andrea De Lucia,et al.  Developer-Related Factors in Change Prediction: An Empirical Assessment , 2017, 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC).

[21]  Yuming Zhou,et al.  Examining the Potentially Confounding Effect of Class Size on the Associations between Object-Oriented Metrics and Change-Proneness , 2009, IEEE Transactions on Software Engineering.

[22]  Rosa Maria Valdovinos,et al.  The Imbalanced Training Sample Problem: Under or over Sampling? , 2004, SSPR/SPR.

[23]  郑肇葆,et al.  基于Naive Bayes Classifiers的航空影像纹理分类 , 2006 .

[24]  Yuming Zhou,et al.  Predicting object-oriented software maintainability using multivariate adaptive regression splines , 2007, J. Syst. Softw..

[25]  Fred P. Brooks,et al.  The Mythical Man-Month , 1975, Reliable Software.

[26]  ZhangHongyu,et al.  Comments on "Data Mining Static Code Attributes to Learn Defect Predictors" , 2007 .

[27]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[28]  Nong Ye,et al.  Naïve Bayes Classifier , 2013 .

[29]  Vandana Bhattacherjee,et al.  Software Fault Prediction Using Quad Tree-Based K-Means Clustering Algorithm , 2012, IEEE Transactions on Knowledge and Data Engineering.

[30]  Qinbao Song,et al.  Using Coding-Based Ensemble Learning to Improve Software Defect Prediction , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[31]  Enio G. Jelihovschi,et al.  ScottKnott: A Package for Performing the Scott-Knott Clustering Algorithm in R , 2014 .

[32]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[33]  Deepa Godara,et al.  Understanding Change Prone Classes in Object Oriented Software , 2014 .

[34]  Daniele Romano,et al.  Using source code metrics to predict change-prone Java interfaces , 2011, 2011 27th IEEE International Conference on Software Maintenance (ICSM).

[35]  Ruchika Malhotra,et al.  A systematic review of machine learning techniques for software fault prediction , 2015, Appl. Soft Comput..

[36]  Ruchika Malhotra,et al.  An empirical study for software change prediction using imbalanced data , 2017, Empirical Software Engineering.

[37]  Yi Zhang,et al.  Classifying Software Changes: Clean or Buggy? , 2008, IEEE Transactions on Software Engineering.

[38]  Sinan Eski,et al.  An Empirical Study on Object-Oriented Metrics and Software Evolution in Order to Reduce Testing Costs by Predicting Change-Prone Classes , 2011, 2011 IEEE Fourth International Conference on Software Testing, Verification and Validation Workshops.

[39]  Adam Kowalczyk,et al.  Second Order Features for Maximising Text Classification Performance , 2001, ECML.

[40]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[41]  Mohammad Alshayeb,et al.  An Empirical Validation of Object-Oriented Metrics in Two Different Iterative Software Processes , 2003, IEEE Trans. Software Eng..

[42]  Xin Yao,et al.  Using Class Imbalance Learning for Software Defect Prediction , 2013, IEEE Transactions on Reliability.

[43]  James M. Bieman,et al.  Design patterns and change proneness: an examination of five evolving systems , 2003, Proceedings. 5th International Workshop on Enterprise Networking and Computing in Healthcare Industry (IEEE Cat. No.03EX717).

[44]  Taghi M. Khoshgoftaar,et al.  Experimental perspectives on learning from imbalanced data , 2007, ICML '07.

[45]  Chris F. Kemerer,et al.  Towards a metrics suite for object oriented design , 2017, OOPSLA '91.

[46]  Tracy Hall,et al.  A Systematic Literature Review on Fault Prediction Performance in Software Engineering , 2012, IEEE Transactions on Software Engineering.

[47]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[48]  Yuming Zhou,et al.  The ability of object-oriented metrics to predict change-proneness: a meta-analysis , 2011, Empirical Software Engineering.

[49]  Nello Cristianini,et al.  Controlling the Sensitivity of Support Vector Machines , 1999 .

[50]  Elliot Soloway,et al.  Where the bugs are , 1985, CHI '85.

[51]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[52]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[53]  Wei Li,et al.  Object-Oriented Metrics Which Predict Maintainability , 1993 .

[54]  M. Braga,et al.  Exploratory Data Analysis , 2018, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[55]  Qinbao Song,et al.  A Comprehensive Investigation of the Role of Imbalanced Learning for Software Defect Prediction , 2019, IEEE Transactions on Software Engineering.