Better cross company defect prediction

How can we find data for quality prediction? Early in the life cycle, projects may lack the data needed to build such predictors. Prior work assumed that relevant training data was found nearest to the local project. But is this the best approach? This paper introduces the Peters filter which is based on the following conjecture: When local data is scarce, more information exists in other projects. Accordingly, this filter selects training data via the structure of other projects. To assess the performance of the Peters filter, we compare it with two other approaches for quality prediction. Within-company learning and cross-company learning with the Burak filter (the state-of-the-art relevancy filter). This paper finds that: 1) within-company predictors are weak for small data-sets; 2) the Peters filter+cross-company builds better predictors than both within-company and the Burak filter+cross-company; and 3) the Peters filter builds 64% more useful predictors than both within-company and the Burak filter+cross-company approaches. Hence, we recommend the Peters filter for cross-company learning.

[1]  Rachel Harrison,et al.  On software engineering repositories and their open problems , 2012, 2012 First International Workshop on Realizing AI Synergies in Software Engineering (RAISE).

[2]  Lech Madeyski,et al.  Towards identifying software project clusters with regard to defect prediction , 2010, PROMISE '10.

[3]  Ayse Basar Bener,et al.  On the relative value of cross-company and within-company data for defect prediction , 2009, Empirical Software Engineering.

[4]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[5]  Tim Menzies,et al.  Balancing Privacy and Utility in Cross-Company Defect Prediction , 2013, IEEE Transactions on Software Engineering.

[6]  Premkumar T. Devanbu,et al.  Recalling the "imprecision" of cross-project defect prediction , 2012, SIGSOFT FSE.

[7]  Ayse Basar Bener,et al.  AI-Based Software Defect Predictors : Applications and Benefits in a Case Study , 2011 .

[8]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007, IEEE Transactions on Software Engineering.

[9]  Tim Menzies,et al.  When to use data from other projects for effort estimation , 2010, ASE.

[10]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[11]  Bart Baesens,et al.  Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings , 2008, IEEE Transactions on Software Engineering.

[12]  Tim Menzies,et al.  Problems with Precision , 2007 .

[13]  Ayse Basar Bener,et al.  Practical considerations in deploying AI for defect prediction: a case study within the Turkish telecommunication industry , 2009, PROMISE '09.

[14]  Wasif Afzal,et al.  Using Faults-Slip-Through Metric as a Predictor of Fault-Proneness , 2010, 2010 Asia Pacific Software Engineering Conference.

[15]  Yue Jiang,et al.  Techniques for evaluating fault prediction models , 2008, Empirical Software Engineering.

[16]  Barry W. Boehm Understanding and Controlling Software Costs , 1988 .

[17]  Guilherme Horta Travassos,et al.  Cross versus Within-Company Cost Estimation Studies: A Systematic Review , 2007, IEEE Transactions on Software Engineering.

[18]  Guangchun Luo,et al.  Transfer learning for cross-company software defect prediction , 2012, Inf. Softw. Technol..

[19]  Tim Menzies,et al.  How to Find Relevant Data for Effort Estimation? , 2011, 2011 International Symposium on Empirical Software Engineering and Measurement.

[20]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[21]  Elaine J. Weyuker,et al.  Do too many cooks spoil the broth? Using the number of developers to enhance defect prediction models , 2008, Empirical Software Engineering.

[22]  Marian Jureczko,et al.  Using Object-Oriented Design Metrics to Predict Software Defects 1* , 2010 .

[23]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[24]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[25]  Tim Menzies,et al.  Local vs. global models for effort estimation and defect prediction , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[26]  Ayse Basar Bener,et al.  Guest editorial: learning to organize testing , 2011, Automated Software Engineering.

[27]  Thomas J. Ostrand,et al.  \{PROMISE\} Repository of empirical software engineering data , 2007 .

[28]  D. Sculley,et al.  Web-scale k-means clustering , 2010, WWW '10.

[29]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[30]  Harald C. Gall,et al.  Cross-project defect prediction: a large scale experiment on data vs. domain vs. process , 2009, ESEC/SIGSOFT FSE.

[31]  Ye Yang,et al.  An investigation on the feasibility of cross-project defect prediction , 2012, Automated Software Engineering.

[32]  Barry W. Boehm,et al.  What we have learned about fighting defects , 2002, Proceedings Eighth IEEE Symposium on Software Metrics.