Nonlinear Geometric Framework for Software Defect Prediction

Humans use the software in every walk of life thus it is essential to have the best quality software. Software defect prediction models assist in identifying defect prone modules with the help of historical data, which in turn improves software quality. Historical data consists of data related to modules /files/classes which are labeled as buggy or clean. As the number of buggy artifacts as less as compared to clean artifacts, the nature of historical data becomes imbalance. Due to this uneven distribution of the data, it difficult for classification algorithms to build highly effective SDP models. The objective of this study is to propose a new nonlinear geometric framework based on SMOTE and ensemble learning to improve the performance of SDP models. The study combines the traditional SMOTE algorithm and the novel ensemble Support Vector Machine (SVM) is used to develop the proposed framework called SMEnsemble. SMOTE algorithm handles the class imbalance problem by generating synthetic instances of the minority class. Ensemble learning generates multiple classification models to select the best performing SDP model. For experimentation, datasets from three different software repositories that contain both open source as well as proprietary projects are used in the study. The results show that SMEnsemble performs better than traditional methods for identifying the minority class i.e. buggy artifacts. Also, the proposed model performance is better than the latest state of Art SDP model- SMOTUNED. The proposed model is capable of handling imbalance classes when compared with traditional methods. Also, by carefully selecting the number of ensembles high performance can be achieved in less time.

[1]  Thomas J. Ostrand,et al.  \{PROMISE\} Repository of empirical software engineering data , 2007 .

[2]  Nathalie Japkowicz,et al.  A Novelty Detection Approach to Classification , 1995, IJCAI.

[3]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[4]  Audris Mockus,et al.  Towards building a universal defect prediction model , 2014, MSR 2014.

[5]  Z. Ramadan,et al.  Metabolic profiling using principal component analysis, discriminant partial least squares, and genetic algorithms. , 2006, Talanta.

[6]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007, IEEE Transactions on Software Engineering.

[7]  Taghi M. Khoshgoftaar,et al.  Experimental perspectives on learning from imbalanced data , 2007, ICML '07.

[8]  Burak Turhan,et al.  Complexity: Using Assemblies of Multiple Models , 2015, MoDELS 2015.

[9]  Tim Menzies,et al.  Tuning for Software Analytics: is it Really Necessary? , 2016, Inf. Softw. Technol..

[10]  ShenXipeng,et al.  Tuning for software analytics , 2016 .

[11]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[12]  Zhong Liu,et al.  A Novel Ensemble Method for Imbalanced Data Learning: Bagging of Extrapolation-SMOTE SVM , 2017, Comput. Intell. Neurosci..

[13]  Jun Zheng,et al.  Cost-sensitive boosting neural networks for software defect prediction , 2010, Expert Syst. Appl..

[14]  M. J. Box A New Method of Constrained Optimization and a Comparison With Other Methods , 1965, Comput. J..

[15]  Taghi M. Khoshgoftaar,et al.  Attribute Selection and Imbalanced Data: Problems in Software Defect Prediction , 2010, 2010 22nd IEEE International Conference on Tools with Artificial Intelligence.

[16]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[17]  Zhi-Hua Zhou,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Training Cost-sensitive Neural Networks with Methods Addressing the Class Imbalance Problem , 2022 .

[18]  Feng Zhang,et al.  Towards Generalizing Defect Prediction Models , 2016 .

[19]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[20]  Jeffrey C. Carver,et al.  Characterizing Software Architecture Changes: An Initial Study , 2007, ESEM 2007.

[21]  Mohammad Alshayeb,et al.  Software defect prediction using ensemble learning on selected features , 2015, Inf. Softw. Technol..

[22]  Xiaoyuan Jing,et al.  Multiple kernel ensemble learning for software defect prediction , 2015, Automated Software Engineering.

[23]  Tim Menzies,et al.  Is "Better Data" Better Than "Better Data Miners"? , 2017, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[24]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[25]  Eleni Anthippi Chatzimichali,et al.  Novel application of heuristic optimisation enables the creation and thorough evaluation of robust support vector machine ensembles for machine learning applications , 2015, Metabolomics.

[26]  Tim Menzies,et al.  "Better Data" is Better than "Better Data Miners" (Benefits of Tuning SMOTE for Defect Prediction) , 2017, ICSE.