Defect Prediction between Software Versions with Active Learning and Dimensionality Reduction

Accurate detection of defects prior to product release helps software engineers focus verification activities on defect prone modules, thus improving the effectiveness of software development. A common scenario is to use the defects from prior releases to build the prediction model for the upcoming release, typically through a supervised learning method. As software development is a dynamic process, fault characteristics in subsequent releases may vary. Therefore, supplementing the defect information from prior releases with limited information about the defects from the current release detected early seems to offer intuitive and practical benefits. We propose active learning as a way to automate the development of models which improve the performance of defect prediction between successive releases. Our results show that the integration of active learning with uncertainty sampling consistently outperforms the corresponding supervised learning approach. We further improve the prediction performance with feature compression techniques, where feature selection or dimensionality reduction is applied to defect data prior to active learning. We observe that dimensionality reduction techniques, particularly multidimensional scaling with random forest similarity, work better than feature selection due to their ability to identify and combine essential information in data set features. We present the improvements offered by this methodology through the prediction of defective modules in the three successive versions of Eclipse.

[1]  Hongfang Liu,et al.  Building effective defect-prediction models in practice , 2005, IEEE Software.

[2]  Ying Ma,et al.  Active Learning for Software Defect Prediction , 2012, IEICE Trans. Inf. Syst..

[3]  Michele Lanza,et al.  An extensive comparison of bug prediction approaches , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).

[4]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007, IEEE Transactions on Software Engineering.

[5]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[6]  S. Dowdy,et al.  Statistics for Research: Dowdy/Statistics , 2005 .

[7]  Mark Craven,et al.  An Analysis of Active Learning Strategies for Sequence Labeling Tasks , 2008, EMNLP.

[8]  Yue Jiang,et al.  Techniques for evaluating fault prediction models , 2008, Empirical Software Engineering.

[9]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[10]  Yanjun Qi,et al.  Random Forest Similarity for Protein-Protein Interaction Prediction from Multiple Sources , 2004, Pacific Symposium on Biocomputing.

[11]  Tracy Hall,et al.  The State of Machine Learning Methodology in Software Fault Prediction , 2012, 2012 11th International Conference on Machine Learning and Applications.

[12]  Lluís A. Belanche Muñoz,et al.  Feature selection algorithms: a survey and experimental evaluation , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[13]  R. Chitra,et al.  Performance Analysis of Datamining Algorithms for Software Quality Prediction , 2009, 2009 International Conference on Advances in Recent Technologies in Communication and Computing.

[14]  Taghi M. Khoshgoftaar,et al.  Comparative Assessment of Software Quality Classification Techniques: An Empirical Case Study , 2004, Empirical Software Engineering.

[15]  Lionel C. Briand,et al.  Assessing the Applicability of Fault-Proneness Models Across Object-Oriented Software Projects , 2002, IEEE Trans. Software Eng..

[16]  Fabrice Heitz,et al.  A Metric Multidimensional Scaling-Based Nonlinear Manifold Learning Approach for Unsupervised Data Reduction , 2008, EURASIP J. Adv. Signal Process..

[17]  Sunghun Kim,et al.  Reducing Features to Improve Code Change-Based Bug Prediction , 2013, IEEE Transactions on Software Engineering.

[18]  Witold Pedrycz,et al.  Analysis of the reliability of a subset of change metrics for defect prediction , 2008, ESEM '08.

[19]  S. S. Iyengar,et al.  An Evaluation of Filter and Wrapper Methods for Feature Selection in Categorical Clustering , 2005, IDA.

[20]  F. S. Tsai,et al.  Dimensionality reduction techniques for data exploration , 2007, 2007 6th International Conference on Information, Communications & Signal Processing.

[21]  Greg Schohn,et al.  Less is More: Active Learning with Support Vector Machines , 2000, ICML.

[22]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[23]  Venkata U. B. Challagulla,et al.  Empirical assessment of machine learning based software defect prediction techniques , 2005, 10th IEEE International Workshop on Object-Oriented Real-Time Dependable Systems.

[24]  Karim O. Elish,et al.  Predicting defect-prone software modules using support vector machines , 2008, J. Syst. Softw..

[25]  Lionel C. Briand,et al.  Predicting fault-prone components in a java legacy system , 2006, ISESE '06.

[26]  Bojan Cukic,et al.  Robust prediction of fault-proneness by random forests , 2004, 15th International Symposium on Software Reliability Engineering.

[27]  Xizhao Wang,et al.  A survey on active learning strategy , 2010, 2010 International Conference on Machine Learning and Cybernetics.

[28]  Shiping Huang,et al.  Exploration of dimensionality reduction for text visualization , 2005, Coordinated and Multiple Views in Exploratory Visualization (CMV'05).

[29]  William Marsh,et al.  Predicting software defects in varying development lifecycles using Bayesian nets , 2007, Inf. Softw. Technol..

[30]  Taghi M. Khoshgoftaar,et al.  Software Defect Prediction for High-Dimensional and Class-Imbalanced Data , 2011, SEKE.

[31]  Zhi-Hua Zhou,et al.  Sample-based software defect prediction with active and semi-supervised learning , 2012, Automated Software Engineering.

[32]  Taghi M. Khoshgoftaar,et al.  A Comparative Study of Ensemble Feature Selection Techniques for Software Defect Prediction , 2010, 2010 Ninth International Conference on Machine Learning and Applications.

[33]  Zhi-Hua Zhou,et al.  Software Defect Detection with Rocus , 2011, Journal of Computer Science and Technology.

[34]  Terry Windeatt,et al.  Relevance and Redundancy Analysis for Ensemble Classifiers , 2009, MLDM.

[35]  Martin Pinzger,et al.  Using the gini coefficient for bug prediction in eclipse , 2011, IWPSE-EVOL '11.

[36]  Shiqing Zhang,et al.  Dimensionality reduction-based phoneme recognition , 2008, 2008 9th International Conference on Signal Processing.

[37]  Ayse Basar Bener,et al.  Defect prediction from static code features: current results, limitations, new approaches , 2010, Automated Software Engineering.

[38]  Richard Torkar,et al.  Software fault prediction metrics: A systematic literature review , 2013, Inf. Softw. Technol..

[39]  A. Zeller,et al.  Predicting Defects for Eclipse , 2007, Third International Workshop on Predictor Models in Software Engineering (PROMISE'07: ICSE Workshops 2007).

[40]  William A. Brenneman Statistics for Research , 2005, Technometrics.

[41]  Qinbao Song,et al.  A General Software Defect-Proneness Prediction Framework , 2011, IEEE Transactions on Software Engineering.

[42]  Elaine J. Weyuker,et al.  Comparing the effectiveness of several modeling methods for fault prediction , 2010, Empirical Software Engineering.

[43]  William A. Brenneman Statistics for Research (3rd ed.) , 2005 .

[44]  Bart Baesens,et al.  Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings , 2008, IEEE Transactions on Software Engineering.