HSDD: A hybrid sampling strategy for class imbalance in defect prediction data sets

Class imbalance is a common problem in defect prediction data sets. In order to cope with this problem, over-sampling and under sampling methods are employed. However, these methods are designed for instance based alteration and not specialized for feature space. Also there is not any distinctive approach to cope with class imbalance in defect prediction data sets. We develop HSDD (hybrid sampling for defect data sets) to solve this problem. HSDD comprises not only derivation of low-level metrics, but also reduction processes of repeated data points. The method was evaluated on industrial and open source project data sets by using Bayes, naive Bayes, random forest, and J48 in terms of g-mean and training time. Obtained results show that HSDD produces promising training performance especially in large-scale data sets.

[1]  Xin Yao,et al.  Online Class Imbalance Learning and its Applications in Fault Detection , 2013, Int. J. Comput. Intell. Appl..

[2]  Bhekisipho Twala,et al.  Reasoning with Noisy Software Effort Data , 2014, Appl. Artif. Intell..

[3]  Keke Gai,et al.  An Empirical Study on Preprocessing High-Dimensional Class-Imbalanced Data for Classification , 2015, 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems.

[4]  Xiang Chen,et al.  FECS: A Cluster Based Feature Selection Method for Software Fault Prediction with Noises , 2015, 2015 IEEE 39th Annual Computer Software and Applications Conference.

[5]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[6]  Xin Yao,et al.  Using Class Imbalance Learning for Software Defect Prediction , 2013, IEEE Transactions on Reliability.

[7]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[8]  Xiang Chen,et al.  FECAR: A Feature Selection Framework for Software Defect Prediction , 2014, 2014 IEEE 38th Annual Computer Software and Applications Conference.

[9]  Yunqian Ma,et al.  Class Imbalance and Active Learning , 2013 .

[10]  Rongxin Wu,et al.  Dealing with noise in defect prediction , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[11]  Francisco Herrera,et al.  Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling , 2011, Soft Comput..

[12]  Tianxiang Gao,et al.  Hybrid classication approach of SMOTE and instance selection for imbalanced datasets , 2015 .

[13]  David Lo,et al.  ELBlocker: Predicting blocking bugs with ensemble imbalance learning , 2015, Inf. Softw. Technol..

[14]  Sashank Dara,et al.  Online Defect Prediction for Imbalanced Data , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[15]  Rok Blagus,et al.  SMOTE for high-dimensional class-imbalanced data , 2013, BMC Bioinformatics.

[16]  Ahmet Zengin,et al.  Towards a Better Understanding of Static Code Attributes for Defect Prediction , 2015, ICSEA 2015.

[17]  Thomas J. Ostrand,et al.  \{PROMISE\} Repository of empirical software engineering data , 2007 .

[18]  Taghi M. Khoshgoftaar,et al.  Feature Selection with Imbalanced Data for Software Defect Prediction , 2009, 2009 International Conference on Machine Learning and Applications.

[19]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[20]  José Javier Dolado,et al.  Preliminary comparison of techniques for dealing with imbalance in software defect prediction , 2014, EASE '14.

[21]  Michele Lanza,et al.  An extensive comparison of bug prediction approaches , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).

[22]  S. Ertekin CLASS IMBALANCE AND ACTIVE LEARNING , 2013 .

[23]  Taghi M. Khoshgoftaar,et al.  A Comparative Study of Ensemble Feature Selection Techniques for Software Defect Prediction , 2010, 2010 Ninth International Conference on Machine Learning and Applications.

[24]  Tracy Hall,et al.  What is the Impact of Imbalance on Software Defect Prediction Performance? , 2015, PROMISE.

[25]  Francisco Herrera,et al.  Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data , 2015, Fuzzy Sets Syst..

[26]  Taghi M. Khoshgoftaar,et al.  Attribute Selection and Imbalanced Data: Problems in Software Defect Prediction , 2010, 2010 22nd IEEE International Conference on Tools with Artificial Intelligence.