Feature Subset Selection and Instance Filtering for Cross-project Defect Prediction - Classification and Ranking

The defect prediction models can be a good tool on organizing the project’s test resources. The models can be constructed with two main goals: 1) to classify the software parts defective or not; or 2) to rank the most defective parts in a decreasing order. However, not all companies maintain an appropriate set of historical defect data. In this case, a company can build an appropriate dataset from known external projects called Crossproject Defect Prediction (CPDP). The CPDP models, however, present low prediction performances due to the heterogeneity of data. Recently, Instance Filtering methods were proposed in order to reduce this heterogeneity by selecting the most similar instances from the training dataset. Originally, the similarity is calculated based on all the available dataset features (or independent variables). We propose that using only the most relevant features on the similarity calculation can result in more accurate filtered datasets and better prediction performances. In this study we extend our previous work. We analyse both prediction goals Classification and Ranking. We present an empirical evaluation of 41 different methods by associating Instance Filtering methods with Feature Selection methods. We used 36 versions of 11 open source projects on experiments. The results show similar evidences for both prediction goals. First, the defect prediction performance of CPDP models can be improved by associating Feature Selection and Instance Filtering. Second, no evaluated method presented general better performances. Indeed, the most appropriate method can vary according to the characteristics of the project being predicted.

[1]  Mohak Shah,et al.  Evaluating Learning Algorithms: A Classification Perspective , 2011 .

[2]  Xin Yao,et al.  A Learning-to-Rank Approach to Software Defect Prediction , 2015, IEEE Transactions on Reliability.

[3]  Ying Zou,et al.  Cross-Project Defect Prediction Using a Connectivity-Based Unsupervised Classifier , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[4]  Lech Madeyski,et al.  Towards identifying software project clusters with regard to defect prediction , 2010, PROMISE '10.

[5]  A. Zeller,et al.  Predicting Defects for Eclipse , 2007, Third International Workshop on Predictor Models in Software Engineering (PROMISE'07: ICSE Workshops 2007).

[6]  Elaine J. Weyuker,et al.  Comparing the effectiveness of several modeling methods for fault prediction , 2010, Empirical Software Engineering.

[7]  Anjaneyulu Pasala,et al.  Evaluating Performance of Network Metrics for Bug Prediction in Software , 2013, 2013 20th Asia-Pacific Software Engineering Conference (APSEC).

[8]  Gábor Csárdi,et al.  The igraph software package for complex network research , 2006 .

[9]  Robert Tibshirani,et al.  A Framework for Feature Selection in Clustering , 2010, Journal of the American Statistical Association.

[10]  Luís Torgo,et al.  Data Mining with R: Learning with Case Studies , 2010 .

[11]  Ayse Basar Bener,et al.  Guest editorial: learning to organize testing , 2011, Automated Software Engineering.

[12]  Elaine J. Weyuker,et al.  Predicting the location and number of faults in large software systems , 2005, IEEE Transactions on Software Engineering.

[13]  Huan Liu,et al.  Advancing Feature Selection Research − ASU Feature Selection Repository , 2010 .

[14]  Elliot Soloway,et al.  Where the bugs are , 1985, CHI '85.

[15]  Yue Jiang,et al.  Techniques for evaluating fault prediction models , 2008, Empirical Software Engineering.

[16]  Adenilso da Silva Simão,et al.  Feature Subset Selection for Instance Filtering Methods on Cross-project Defect Prediction , 2016, CIbSE.

[17]  Giovanni Denaro,et al.  An empirical evaluation of fault-proneness models , 2002, ICSE '02.

[18]  Guilherme Horta Travassos,et al.  Cross versus Within-Company Cost Estimation Studies: A Systematic Review , 2007, IEEE Transactions on Software Engineering.

[19]  Tim Menzies,et al.  Learning from Open-Source Projects: An Empirical Study on Defect Prediction , 2013, 2013 ACM / IEEE International Symposium on Empirical Software Engineering and Measurement.

[20]  Foutse Khomh,et al.  Predicting Bugs Using Antipatterns , 2013, 2013 IEEE International Conference on Software Maintenance.

[21]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[22]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[23]  Victor R. Basili,et al.  The influence of organizational structure on software quality , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[24]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[25]  Haruhiko Kaiya,et al.  Adapting a fault prediction model to allow inter languagereuse , 2008, PROMISE '08.

[26]  Bart Baesens,et al.  Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings , 2008, IEEE Transactions on Software Engineering.

[27]  Andreas Zeller,et al.  Predicting defects using change genealogies , 2013, 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE).

[28]  Harald C. Gall,et al.  Cross-project defect prediction: a large scale experiment on data vs. domain vs. process , 2009, ESEC/SIGSOFT FSE.

[29]  Brendan Murphy,et al.  Can developer-module networks predict failures? , 2008, SIGSOFT '08/FSE-16.

[30]  Guangchun Luo,et al.  Transfer learning for cross-company software defect prediction , 2012, Inf. Softw. Technol..

[31]  Zhi-Hua Zhou,et al.  Sample-based software defect prediction with active and semi-supervised learning , 2012, Automated Software Engineering.

[32]  Sinno Jialin Pan,et al.  Transfer defect learning , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[33]  Victor R. Basili,et al.  A Validation of Object-Oriented Design Metrics as Quality Indicators , 1996, IEEE Trans. Software Eng..

[34]  Tim Menzies,et al.  Better cross company defect prediction , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[35]  Ayse Basar Bener,et al.  On the relative value of cross-company and within-company data for defect prediction , 2009, Empirical Software Engineering.

[36]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[37]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[38]  ZhangHongyu,et al.  Comments on "Data Mining Static Code Attributes to Learn Defect Predictors" , 2007 .

[39]  Nachiappan Nagappan,et al.  Predicting defects using network analysis on dependency graphs , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[40]  Audris Mockus,et al.  Towards building a universal defect prediction model , 2014, MSR 2014.

[41]  Ahmed E. Hassan,et al.  Predicting faults using the complexity of code changes , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[42]  Taghi M. Khoshgoftaar,et al.  A Comprehensive Empirical Study of Count Models for Software Fault Prediction , 2007, IEEE Transactions on Reliability.