On Software Defect Prediction Using Machine Learning

This paper mainly deals with how kernel method can be used for software defect prediction, since the class imbalance can greatly reduce the performance of defect prediction. In this paper, two classifiers, namely, the asymmetric kernel partial least squares classifier (AKPLSC) and asymmetric kernel principal component analysis classifier (AKPCAC), are proposed for solving the class imbalance problem. This is achieved by applying kernel function to the asymmetric partial least squares classifier and asymmetric principal component analysis classifier, respectively. The kernel function used for the two classifiers is Gaussian function. Experiments conducted on NASA and SOFTLAB data sets using F-measure, Friedman’s test, and Tukey’s test confirm the validity of our methods.

[1]  William Mendenhall,et al.  Statistics for Engineering and the Sciences (5th Edition) , 2006 .

[2]  Ying Ma,et al.  Asymmetric Learning Based on Kernel Partial Least Squares for Software Defect Prediction , 2012, IEICE Trans. Inf. Syst..

[3]  Roman Rosipal,et al.  Kernel PLS-SVC for Linear and Nonlinear Classification , 2003, ICML.

[4]  D. M. Titterington,et al.  Do unbalanced data have a negative effect on LDA? , 2008, Pattern Recognit..

[5]  Guangchun Luo,et al.  Transfer learning for cross-company software defect prediction , 2012, Inf. Softw. Technol..

[6]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[7]  Taghi M. Khoshgoftaar,et al.  Balancing Misclassification Rates in Classification-Tree Models of Software Quality , 2004, Empirical Software Engineering.

[8]  I. Jolliffe Principal Component Analysis , 2002 .

[9]  Andrzej Cichocki,et al.  Kernel PCA for Feature Extraction and De-Noising in Nonlinear Regression , 2001, Neural Computing & Applications.

[10]  Thomas J. Ostrand,et al.  \{PROMISE\} Repository of empirical software engineering data , 2007 .

[11]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[12]  Guo-Zheng Li,et al.  An asymmetric classifier based on partial least squares , 2010, Pattern Recognit..

[13]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[14]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[15]  Bojan Cukic,et al.  Robust prediction of fault-proneness by random forests , 2004, 15th International Symposium on Software Reliability Engineering.

[16]  Hao Chen,et al.  Kernel Based Asymmetric Learning for Software Defect Prediction , 2012, IEICE Trans. Inf. Syst..

[17]  M. Friedman A Comparison of Alternative Tests of Significance for the Problem of $m$ Rankings , 1940 .

[18]  MaYing,et al.  Transfer learning for cross-company software defect prediction , 2012 .

[19]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[20]  Taghi M. Khoshgoftaar,et al.  Improving Software-Quality Predictions With Data Sampling and Boosting , 2009, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[21]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[22]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007, IEEE Transactions on Software Engineering.

[23]  M. Barker,et al.  Partial least squares for discrimination , 2003 .

[24]  Xudong Jiang,et al.  Asymmetric Principal Component and Discriminant Analyses for Pattern Classification , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  W. Mendenhall,et al.  Statistics for engineering and the sciences , 1984 .

[26]  Taghi M. Khoshgoftaar,et al.  Using regression trees to classify fault-prone software modules , 2002, IEEE Trans. Reliab..

[27]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.