An Empirical Study on the Effectiveness of Feature Selection for Cross-Project Defect Prediction

Software defect prediction has attracted much attention of researchers in software engineering. At present, feature selection approaches have been introduced into software defect prediction, which can improve the performance of traditional defect prediction (known as within-project defect prediction, WPDP) effectively. However, the studies on feature selection are not sufficient for cross-project defect prediction (CPDP). In this paper, we use the feature subset selection and feature ranking approaches to explore the effectiveness of feature selection for CPDP. An empirical study is conducted on NASA and PROMISE datasets. The results show that both the feature subset selection and feature ranking approaches can improve the performance of CPDP. Therefore, we should select the representative feature subset or set a reasonable proportion of selected features to improve the performance of CPDP in future studies.

[1]  Jacob Cohen Statistical Power Analysis for the Behavioral Sciences , 1969, The SAGE Encyclopedia of Research Design.

[2]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[3]  Richard Torkar,et al.  Software fault prediction metrics: A systematic literature review , 2013, Inf. Softw. Technol..

[4]  Taghi M. Khoshgoftaar,et al.  Choosing software metrics for defect prediction: an investigation on feature selection techniques , 2011, Softw. Pract. Exp..

[5]  Ye Yang,et al.  An investigation on the feasibility of cross-project defect prediction , 2012, Automated Software Engineering.

[6]  Harald C. Gall,et al.  Cross-project defect prediction: a large scale experiment on data vs. domain vs. process , 2009, ESEC/SIGSOFT FSE.

[7]  David Lo,et al.  HYDRA: Massively Compositional Model for Cross-Project Defect Prediction , 2016, IEEE Transactions on Software Engineering.

[8]  A. Karegowda,et al.  COMPARATIVE STUDY OF ATTRIBUTE SELECTION USING GAIN RATIO AND CORRELATION BASED FEATURE SELECTION , 2010 .

[9]  Jongmoon Baik,et al.  A Hybrid Instance Selection Using Nearest-Neighbor for Cross-Project Defect Prediction , 2015, Journal of Computer Science and Technology.

[10]  Shujuan Jiang,et al.  Which Is More Important for Cross-Project Defect Prediction: Instance or Feature? , 2016, 2016 International Conference on Software Analysis, Testing and Evolution (SATE).

[11]  Daoqiang Zhang,et al.  Cost-sensitive feature selection with application in software defect prediction , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[12]  Amri Napolitano,et al.  A comparative study of iterative and non-iterative feature selection techniques for software defect prediction , 2013, Information Systems Frontiers.

[13]  Burak Turhan,et al.  A benchmark study on the effectiveness of search-based data selection and feature selection for cross project defect prediction , 2017, Inf. Softw. Technol..

[14]  Song Wang,et al.  Automatically Learning Semantic Features for Defect Prediction , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[15]  Xiang Chen,et al.  FECAR: A Feature Selection Framework for Software Defect Prediction , 2014, 2014 IEEE 38th Annual Computer Software and Applications Conference.

[16]  David Lo,et al.  An Empirical Study of Classifier Combination for Cross-Project Defect Prediction , 2015, 2015 IEEE 39th Annual Computer Software and Applications Conference.

[17]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[18]  Sinno Jialin Pan,et al.  Transfer defect learning , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[19]  Premkumar T. Devanbu,et al.  Recalling the "imprecision" of cross-project defect prediction , 2012, SIGSOFT FSE.

[20]  Tracy Hall,et al.  A Systematic Literature Review on Fault Prediction Performance in Software Engineering , 2012, IEEE Transactions on Software Engineering.

[21]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[22]  Lionel C. Briand,et al.  Assessing the Applicability of Fault-Proneness Models Across Object-Oriented Software Projects , 2002, IEEE Trans. Software Eng..

[23]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[24]  Xiao Liu,et al.  An empirical study on software defect prediction with a simplified metric set , 2014, Inf. Softw. Technol..

[25]  Qinbao Song,et al.  Data Quality: Some Comments on the NASA Software Defect Datasets , 2013, IEEE Transactions on Software Engineering.

[26]  Mohammad Alshayeb,et al.  Software defect prediction using ensemble learning on selected features , 2015, Inf. Softw. Technol..

[27]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[28]  Akito Monden,et al.  An Ensemble Approach of Simple Regression Models to Cross-Project Fault Prediction , 2012, 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing.

[29]  Ayse Basar Bener,et al.  On the relative value of cross-company and within-company data for defect prediction , 2009, Empirical Software Engineering.

[30]  Jongmoon Baik,et al.  Effective multi-objective naïve Bayes learning for cross-project defect prediction , 2016, Appl. Soft Comput..

[31]  Ying Zou,et al.  Cross-Project Defect Prediction Using a Connectivity-Based Unsupervised Classifier , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[32]  Jens Grabowski,et al.  A Comparative Study to Benchmark Cross-Project Defect Prediction Approaches , 2018, IEEE Transactions on Software Engineering.

[33]  Shujuan Jiang,et al.  A feature selection approach based on a similarity measure for software defect prediction , 2017, Frontiers of Information Technology & Electronic Engineering.

[34]  Tim Menzies,et al.  Learning from Open-Source Projects: An Empirical Study on Defect Prediction , 2013, 2013 ACM / IEEE International Symposium on Empirical Software Engineering and Measurement.

[35]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[36]  Shujuan Jiang,et al.  A feature matching and transfer approach for cross-company defect prediction , 2017, J. Syst. Softw..

[37]  Ruchika Malhotra,et al.  A systematic review of machine learning techniques for software fault prediction , 2015, Appl. Soft Comput..

[38]  Burak Turhan,et al.  A Systematic Literature Review and Meta-Analysis on Cross Project Defect Prediction , 2019, IEEE Transactions on Software Engineering.

[39]  Ivor W. Tsang,et al.  Domain Adaptation via Transfer Component Analysis , 2009, IEEE Transactions on Neural Networks.