An empirical evaluation of outlier deletion methods for analogy-based cost estimation

Background: Any software project dataset sometimes includes outliers which affect the accuracy of effort estimation. Outlier deletion methods are often used to eliminate them. However, there are few case studies which apply outlier deletion methods to analogy-based estimation, so it is not clear which method is more suitable for analogy-based estimation. Aim: Clarifying the effects of existing outlier deletion methods (Cook's distance based deletion, LTS based deletion, k-means based deletion, Mantel's correlation based deletion, and EID based deletion) and our method for analogy-based estimation. Method: In the experiment, outlier deletion methods were applied to three kinds of datasets (the ISBSG, Kitchenham, and Desharnais datasets), and their estimation accuracy evaluated based on BRE (Balanced Relative Error). Our method eliminates outliers from the neighborhoods of a target project when the effort is extremely different from other neighborhoods. Results: Deletion methods which are designed to apply to analogy-based estimation (i.e. Mantel's correlation based deletion, EID based deletion, and our method) showed stable performance. Especially, only our method showed over 10% improvement of the average BRE on two datasets. Conclusions: It is reasonable to apply deletion methods designed for analogy-based estimation, and more preferable to apply our method to analogy-based estimation.

[1]  Ioannis Stamelos,et al.  A Simulation Tool for Efficient Analogy Based Cost Estimation , 2000, Empirical Software Engineering.

[2]  Emilia Mendes,et al.  An Empirical Analysis of Linear Adaptation Techniques for Case-Based Prediction , 2003, ICCBR.

[3]  Ayse Basar Bener,et al.  Feature weighting heuristics for analogy-based effort estimation models , 2009, Expert Syst. Appl..

[4]  Shari Lawrence Pfleeger,et al.  An empirical study of maintenance and development estimation accuracy , 2002, J. Syst. Softw..

[5]  Douglas Fisher,et al.  Machine Learning Approaches to Estimating Software Development Effort , 1995, IEEE Trans. Software Eng..

[6]  W. Eric Wong,et al.  Outlier elimination in construction of software metric models , 2007, SAC '07.

[7]  D. Ross Jeffery,et al.  An Empirical Study of Analogy-based Software Effort Estimation , 1999, Empirical Software Engineering.

[8]  Emilia Mendes,et al.  A replicated assessment of the use of adaptation rules to improve Web cost estimation , 2003, 2003 International Symposium on Empirical Software Engineering, 2003. ISESE 2003. Proceedings..

[9]  Barry W. Boehm,et al.  Software Engineering Economics , 1993, IEEE Transactions on Software Engineering.

[10]  Christopher J. Lokan,et al.  What should you optimize when building an estimation model? , 2005, 11th IEEE International Software Metrics Symposium (METRICS'05).

[11]  Akito Monden,et al.  Is This Cost Estimate Reliable? -- The Relationship between Homogeneity of Analogues and Estimation Reliability , 2007, First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007).

[12]  Martin Shepperd,et al.  Case and Feature Subset Selection in Case-Based Software Project Effort Prediction , 2003 .

[13]  Stephen G. MacDonell,et al.  What accuracy statistics really measure , 2001, IEE Proc. Softw..

[14]  H. E. Dunsmore,et al.  Software engineering metrics and models , 1986 .

[15]  Yeong-Seok Seo,et al.  Filtering of Inconsistent Software Project Data for Analogy-Based Effort Estimation , 2010, 2010 IEEE 34th Annual Computer Software and Applications Conference.

[16]  Guilherme Horta Travassos,et al.  Cross versus Within-Company Cost Estimation Studies: A Systematic Review , 2007, IEEE Transactions on Software Engineering.

[17]  Martin J. Shepperd,et al.  Estimating Software Project Effort Using Analogies , 1997, IEEE Trans. Software Eng..

[18]  Doo-Hwan Bae,et al.  An empirical analysis of software effort estimation with outlier elimination , 2008, PROMISE '08.

[19]  D. Ross Jeffery,et al.  Analogy-X: Providing Statistical Inference to Analogy-Based Software Cost Estimation , 2008, IEEE Transactions on Software Engineering.

[20]  Emilia Mendes,et al.  Cross-company vs. single-company web effort models using the Tukutuku database: An extended study , 2008, J. Syst. Softw..

[21]  Magne Jørgensen,et al.  A comparison of software project overruns - flexible versus sequential development models , 2005, IEEE Transactions on Software Engineering.

[22]  Y. Miyazaki,et al.  Robust regression for developing software estimation models , 1994, J. Syst. Softw..

[23]  Emilia Mendes,et al.  Cross-company and single-company effort models using the ISBSG database: a further replicated study , 2006, ISESE '06.

[24]  Thomas J. Ostrand,et al.  \{PROMISE\} Repository of empirical software engineering data , 2007 .

[25]  Adam A. Porter,et al.  Learning from Examples: Generation and Evaluation of Decision Trees for Software Resource Analysis , 1988, IEEE Trans. Software Eng..

[26]  Colin J Burgess,et al.  Can genetic programming improve software effort estimation? A comparative evaluation , 2001, Inf. Softw. Technol..