Combating class imbalance problem in semi-supervised defect detection

Detection of defect-prone software modules is an important topic in software quality research, and widely studied under enough defect data circumstance. An improved semi-supervised learning approach for defect detection involving class imbalanced and limited labeled data problem has been proposed. This approach employs random under-sampling technique to resample the original training set and updating training set in each round for co-train style algorithm. In comparison with conventional machine learning approaches, our method has significant superior performance in the aspect of AUC (area under the receiver operating characteristic) metric. Experimental results also show that with the proposed learning approach, it is possible to design better method to tackle the class imbalanced problem in semi-supervised learning.

[1]  Taghi M. Khoshgoftaar,et al.  Evolutionary Optimization of Software Quality Modeling with Multiple Repositories , 2010, IEEE Transactions on Software Engineering.

[2]  Banu Diri,et al.  Unlabelled extra data do not always mean extra performance for semi‐supervised fault prediction , 2009, Expert Syst. J. Knowl. Eng..

[3]  Yue Jiang,et al.  Misclassification cost-sensitive fault prediction models , 2009, PROMISE '09.

[4]  Yan Zhou,et al.  Enhancing Supervised Learning with Unlabeled Data , 2000, ICML.

[5]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007, IEEE Transactions on Software Engineering.

[6]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[7]  Zhi-Hua Zhou,et al.  Tri-training: exploiting unlabeled data using three classifiers , 2005, IEEE Transactions on Knowledge and Data Engineering.

[8]  Taghi M. Khoshgoftaar,et al.  Evolutionary Sampling and Software Quality Modeling of High-Assurance Systems , 2009, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[9]  Mehryar Mohri,et al.  AUC Optimization vs. Error Rate Minimization , 2003, NIPS.

[10]  Taghi M. Khoshgoftaar,et al.  An Empirical Evaluation of Repetitive Undersampling Techniques , 2010, Int. J. Softw. Eng. Knowl. Eng..

[11]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[12]  Taghi M. Khoshgoftaar,et al.  Semi-supervised learning for software quality estimation , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.