An Automated Defect Prediction Framework using Genetic Algorithms: A Validation of Empirical Studies

Today, it is common for software projects to collect measurement data through development processes. With these data, defect prediction software can try to estimate the defect proneness of a software module, with the objective of assisting and guiding software practitioners. With timely and accurate defect predictions, practitioners can focus their limited testing resources on higher risk areas. This paper reports the results of three empirical studies that uses an automated genetic defect prediction framework. This framework generates and compares different learning schemes (preprocessing + attribute selection + learning algorithms) and selects the best one using a genetic algorithm, with the objective to estimate the defect proneness of a software module. The first empirical study is a performance comparison of our framework with the most important framework of the literature. The second empirical study is a performance and runtime comparison between our framework and an exhaustive framework. The third empirical study is a sensitivity analysis. The last empirical study, is our main contribution in this paper. Performance of the software development defect prediction models (using AUC, Area Under the Curve) was validated using NASA-MDP and PROMISE data sets. Seventeen data sets from NASA-MDP (13) and PROMISE (4) projects were analyzed running a NxM-fold cross-validation. A genetic algorithm was used to select the components of the learning schemes automatically, and to assess and report the results. Our results reported similar performance between frameworks. Our framework reported better runtime than exhaustive framework. Finally, we reported the best configuration according to sensitivity analysis.

[1]  Premkumar T. Devanbu,et al.  How, and why, process metrics are better , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[2]  Banu Diri,et al.  Clustering and Metrics Thresholds Based Software Fault Prediction of Unlabeled Program Modules , 2009, 2009 Sixth International Conference on Information Technology: New Generations.

[3]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.

[4]  Qinbao Song,et al.  Software defect association mining and defect correction effort prediction , 2006, IEEE Transactions on Software Engineering.

[5]  Tracy Hall,et al.  A Systematic Literature Review on Fault Prediction Performance in Software Engineering , 2012, IEEE Transactions on Software Engineering.

[6]  Taghi M. Khoshgoftaar,et al.  Regression modelling of software quality: empirical investigation☆ , 1990 .

[7]  Maurice H. Halstead,et al.  Elements of software science , 1977 .

[8]  Taghi M. Khoshgoftaar,et al.  The use of decision trees for cost‐sensitive classification: an empirical study in software quality prediction , 2011, Wiley Interdiscip. Rev. Data Min. Knowl. Discov..

[9]  Bojan Cukic,et al.  An adaptive approach with active learning in software fault prediction , 2012, PROMISE '12.

[10]  R. Chitra,et al.  Performance Analysis of Datamining Algorithms for Software Quality Prediction , 2009, 2009 International Conference on Advances in Recent Technologies in Communication and Computing.

[11]  Bharavi Mishra,et al.  Impact of attribute selection on defect proneness prediction in OO software , 2011, 2011 2nd International Conference on Computer and Communication Technology (ICCCT-2011).

[12]  G. Rédei,et al.  Encyclopedia of Genetics, Genomics, Proteomics, and Informatics , 2008 .

[13]  Yue Jiang,et al.  Can data transformation help in the detection of fault-prone modules? , 2008, DEFECTS '08.

[14]  Marcelo Jenkins,et al.  A Software Defect-Proneness Prediction Framework: A new approach using genetic algorithms to generate learning schemes , 2015, SEKE.

[15]  Taghi M. Khoshgoftaar,et al.  Predicting Faults in High Assurance Software , 2010, 2010 IEEE 12th International Symposium on High Assurance Systems Engineering.

[16]  A. Sharma,et al.  A comparative study of modified crossover operators , 2015, 2015 Third International Conference on Image Information Processing (ICIIP).

[17]  V. Basili Software modeling and measurement: the Goal/Question/Metric paradigm , 1992 .

[18]  Tim Menzies,et al.  The \{PROMISE\} Repository of Software Engineering Databases. , 2005 .

[19]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[20]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[21]  Aurora Trinidad Ramirez Pozo,et al.  A symbolic fault-prediction model based on multiobjective particle swarm optimization , 2010, J. Syst. Softw..

[22]  Rubén Fuentes-Fernández,et al.  An Empirical Validation of Learning Schemes Using an Automated Genetic Defect Prediction Framework , 2016, IBERAMIA.

[23]  Qinbao Song,et al.  A General Software Defect-Proneness Prediction Framework , 2011, IEEE Transactions on Software Engineering.

[24]  Arvinder Kaur,et al.  Empirical validation of object-oriented metrics for predicting fault proneness at different severity levels using support vector machines , 2010, Int. J. Syst. Assur. Eng. Manag..

[25]  Arvinder Kaur,et al.  Empirical validation of object-oriented metrics for predicting fault proneness models , 2010, Software Quality Journal.

[26]  Ruchika Malhotra,et al.  A systematic review of machine learning techniques for software fault prediction , 2015, Appl. Soft Comput..

[27]  Lionel C. Briand,et al.  A systematic and comprehensive investigation of methods to build and evaluate fault prediction models , 2010, J. Syst. Softw..

[28]  P. Singh,et al.  Empirical investigation of fault prediction capability of object oriented metrics of open source software , 2012, 2012 Ninth International Conference on Computer Science and Software Engineering (JCSSE).

[29]  Taghi M. Khoshgoftaar,et al.  Attribute Selection and Imbalanced Data: Problems in Software Defect Prediction , 2010, 2010 22nd IEEE International Conference on Tools with Artificial Intelligence.

[30]  Peter A. Flach,et al.  Learning Decision Trees Using the Area Under the ROC Curve , 2002, ICML.

[31]  C. Srinivas,et al.  Sensitivity Analysis to Determine the Parameters of Genetic Algorithm for Machine Layout , 2014 .

[32]  Banu Diri,et al.  Unlabelled extra data do not always mean extra performance for semi‐supervised fault prediction , 2009, Expert Syst. J. Knowl. Eng..

[33]  Cagatay Catal,et al.  Performance Evaluation Metrics for Software Fault Prediction Studies , 2012 .

[34]  Ruchika Malhotra,et al.  Comparative analysis of statistical and machine learning methods for predicting faulty modules , 2014, Appl. Soft Comput..

[35]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[36]  Nader B. Ebrahimi,et al.  On the Statistical Analysis of the Number of Errors Remaining in a Software Design Document after Inspection , 1997, IEEE Trans. Software Eng..

[37]  Anas N. Al-Rabadi,et al.  A comparison of modified reconstructability analysis and Ashenhurst‐Curtis decomposition of Boolean functions , 2004 .

[38]  Amri Napolitano,et al.  Software measurement data reduction using ensemble techniques , 2012, Neurocomputing.

[39]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007, IEEE Transactions on Software Engineering.

[40]  Christian Quesada-López,et al.  Software Fault Prediction: A Systematic Mapping Study , 2015, CIbSE.

[41]  Carol Withrow,et al.  Prediction and control of ADA software defects , 1990, J. Syst. Softw..

[42]  Jacques Periaux,et al.  Genetic Algorithms in Engineering and Computer Science , 1996 .

[43]  Yue Jiang,et al.  Comparing design and code metrics for software quality prediction , 2008, PROMISE '08.

[44]  Yue Jiang,et al.  Variance Analysis in Software Fault Prediction Models , 2009, 2009 20th International Symposium on Software Reliability Engineering.

[45]  Ruchika Malhotra,et al.  Fault Prediction Using Statistical and Machine Learning Methods for Improving Software Quality , 2012, J. Inf. Process. Syst..