On the relative value of cross-company and within-company data for defect prediction

We propose a practical defect prediction approach for companies that do not track defect related data. Specifically, we investigate the applicability of cross-company (CC) data for building localized defect predictors using static code features. Firstly, we analyze the conditions, where CC data can be used as is. These conditions turn out to be quite few. Then we apply principles of analogy-based learning (i.e. nearest neighbor (NN) filtering) to CC data, in order to fine tune these models for localization. We compare the performance of these models with that of defect predictors learned from within-company (WC) data. As expected, we observe that defect predictors learned from WC data outperform the ones learned from CC data. However, our analyses also yield defect predictors learned from NN-filtered CC data, with performance close to, but still not better than, WC data. Therefore, we perform a final analysis for determining the minimum number of local defect reports in order to learn WC defect predictors. We demonstrate in this paper that the minimum number of data samples required to build effective defect predictors can be quite small and can be collected quickly within a few months. Hence, for companies with no local defect data, we recommend a two-phase approach that allows them to employ the defect prediction process instantaneously. In phase one, companies should use NN-filtered CC data to initiate the defect prediction process and simultaneously start collecting WC (local) data. Once enough WC data is collected (i.e. after a few months), organizations should switch to phase two and use predictors learned from WC data.

[1]  Geoffrey I. Webb,et al.  Proportional k-Interval Discretization for Naive-Bayes Classifiers , 2001, ECML.

[2]  Fred P. Brooks,et al.  The Mythical Man-Month , 1975, Reliable Software.

[3]  Hongfang Liu,et al.  Theory of relative defect proneness , 2008, Empirical Software Engineering.

[4]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[5]  Maurice H. Halstead,et al.  Elements of software science (Operating and programming systems series) , 1977 .

[6]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[7]  Thomas J. Ostrand,et al.  \{PROMISE\} Repository of empirical software engineering data , 2007 .

[8]  Rudolf Ferenc,et al.  Using the Conceptual Cohesion of Classes for Fault Prediction in Object-Oriented Systems , 2008, IEEE Transactions on Software Engineering.

[9]  RICHARD 0. DUDA,et al.  Subjective bayesian methods for rule-based inference systems , 1899, AFIPS '76.

[10]  Martin Höst,et al.  Sensitivity of Website Reliability to Usage Profile Changes , 2007, The 18th IEEE International Symposium on Software Reliability (ISSRE '07).

[11]  Daniel Ryan Baker,et al.  A Hybrid Approach to Expert and Model Based Effort Estimation , 2007 .

[12]  Michael Fagan Design and Code Inspections to Reduce Errors in Program Development , 1976, IBM Syst. J..

[13]  Tim Menzies,et al.  Feature subset selection can improve software cost estimation accuracy , 2005, ACM SIGSOFT Softw. Eng. Notes.

[14]  Forrest Shull,et al.  How perspective-based reading can improve requirements inspections , 2000, Computer.

[15]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[16]  John C. Munson,et al.  Developing fault predictors for evolving software systems , 2003, Proceedings. 5th International Workshop on Enterprise Networking and Computing in Healthcare Industry (IEEE Cat. No.03EX717).

[17]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[18]  Marvin V. Zelkowitz,et al.  Lessons learned from 25 years of process improvement: the rise and fall of the NASA software engineering laboratory , 2002, Proceedings of the 24th International Conference on Software Engineering. ICSE 2002.

[19]  Martin J. Shepperd,et al.  Estimating Software Project Effort Using Analogies , 1997, IEEE Trans. Software Eng..

[20]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007, IEEE Transactions on Software Engineering.

[21]  Katerina Goseva-Popstojanova,et al.  Architecture-Based Software Reliability: Why Only a Few Parameters Matter? , 2007, 31st Annual International Computer Software and Applications Conference (COMPSAC 2007).

[22]  Witold Pedrycz,et al.  A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[23]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[24]  Douglas Fisher,et al.  Machine Learning Approaches to Estimating Software Development Effort , 1995, IEEE Trans. Software Eng..

[25]  John C. Munson,et al.  Software evolution: code delta and code churn , 2000, J. Syst. Softw..

[26]  Guilherme Horta Travassos,et al.  Cross versus Within-Company Cost Estimation Studies: A Systematic Review , 2007, IEEE Transactions on Software Engineering.

[27]  Taghi M. Khoshgoftaar,et al.  Comparative Assessment of Software Quality Classification Techniques: An Empirical Case Study , 2004, Empirical Software Engineering.

[28]  Marvin V. Zelkowitz,et al.  Complexity Measure Evaluation and Selection , 1995, IEEE Trans. Software Eng..

[29]  Thomas Ball,et al.  Static analysis tools as early indicators of pre-release defect density , 2005, ICSE.

[30]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[31]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007, IEEE Transactions on Software Engineering.

[32]  Adam A. Porter,et al.  Empirically guided software development using metric-based classification trees , 1990, IEEE Software.

[33]  Barry W. Boehm Safe and Simple Software Cost Analysis , 2000, IEEE Software.

[34]  Darrel C. Ince,et al.  A critique of three metrics , 1994, J. Syst. Softw..

[35]  J. R. Quinlan Learning With Continuous Classes , 1992 .

[36]  Shari Lawrence Pfleeger,et al.  Software Metrics : A Rigorous and Practical Approach , 1998 .

[37]  Michael E. Fagan Advances in software inspections , 1986, IEEE Transactions on Software Engineering.

[38]  Tim Menzies,et al.  Feature Subset Selection Can Improve Software Cost Estimation , 2005 .

[39]  Elaine J. Weyuker,et al.  Automating algorithms for the identification of fault-prone files , 2007, ISSTA '07.

[40]  N. Nagappan,et al.  Static analysis tools as early indicators of pre-release defect density , 2005, Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005..

[41]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[42]  Bart Baesens,et al.  Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings , 2008, IEEE Transactions on Software Engineering.

[43]  Anas N. Al-Rabadi,et al.  A comparison of modified reconstructability analysis and Ashenhurst‐Curtis decomposition of Boolean functions , 2004 .

[44]  Andres S. Orrego,et al.  SAWTOOTH: Learning from Huge Amounts of Data , 2004 .

[45]  John D. Musa,et al.  Software reliability - measurement, prediction, application , 1987, McGraw-Hill series in software engineering and technology.

[46]  Elaine J. Weyuker,et al.  Looking for bugs in all the right places , 2006, ISSTA '06.

[47]  Ian Witten,et al.  Data Mining , 2000 .

[48]  Tim Menzies,et al.  Model-based tests of truisms , 2002, Proceedings 17th IEEE International Conference on Automated Software Engineering,.

[49]  Burak Turhan,et al.  Implications of ceiling effects in defect predictors , 2008, PROMISE '08.

[50]  Barry W. Boehm,et al.  What we have learned about fighting defects , 2002, Proceedings Eighth IEEE Symposium on Software Metrics.

[51]  Steven R. Rakitin,et al.  Software verification and validation for practitioners and managers , 2001 .

[52]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007 .

[53]  Maurice H. Halstead,et al.  Elements of software science , 1977 .

[54]  Harvey P. Siy,et al.  Predicting Fault Incidence Using Software Change History , 2000, IEEE Trans. Software Eng..

[55]  Taghi M. Khoshgoftaar,et al.  Noise identification with the k-means algorithm , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[56]  M KhoshgoftaarTaghi,et al.  Analogy-Based Practical Classification Rules for Software Quality Estimation , 2003 .

[57]  John D. Musa,et al.  Software reliability measurement , 1984, J. Syst. Softw..

[58]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[59]  Taghi M. Khoshgoftaar,et al.  Analogy-Based Practical Classification Rules for Software Quality Estimation , 2003, Empirical Software Engineering.

[60]  Lionel C. Briand,et al.  Predicting fault-prone components in a java legacy system , 2006, ISESE '06.

[61]  Michael E. Fagan Design and Code Inspections to Reduce Errors in Program Development (Reprint) , 2002, Software Pioneers.

[62]  Tim Menzies,et al.  Text is Software Too , 2004, MSR.

[63]  Richard O. Duda,et al.  Subjective bayesian methods for rule-based inference systems , 1976, AFIPS '76.

[64]  Yue Jiang,et al.  Fault Prediction using Early Lifecycle Data , 2007, The 18th IEEE International Symposium on Software Reliability (ISSRE '07).

[65]  Tim Menzies,et al.  Problems with Precision , 2007 .

[66]  Barry W. Boehm,et al.  Understanding and Controlling Software Costs , 1988, IEEE Trans. Software Eng..

[67]  Tim Menzies,et al.  Assessing Predictors of Software Defects , 2004 .

[68]  Jr. Frederick P. Brooks,et al.  The mythical man-month (anniversary ed.) , 1995 .

[69]  Jane Huffman Hayes,et al.  Advancing candidate link generation for requirements tracing: the study of methods , 2006, IEEE Transactions on Software Engineering.