Investigating the use of duration-based moving windows to improve software effort prediction: A replicated study

Abstract Context Most research in software effort estimation has not considered chronology when selecting projects for training and testing sets. A chronological split represents the use of a projects starting and completion dates, such that any model that estimates effort for a new project p only uses as training data projects that were completed prior to p ’s start. Four recent studies investigated the use of chronological splits, using moving windows wherein only the most recent projects completed prior to a projects starting date were used as training data. The first three studies (S1–S3) found some evidence in favor of using windows; they all defined window sizes as being fixed numbers of recent projects. In practice, we suggest that estimators think in terms of elapsed time rather than the size of the data set, when deciding which projects to include in a training set. In the fourth study (S4) we showed that the use of windows based on duration can also improve estimation accuracy. Objective This papers contribution is to extend S4 using an additional dataset, and to also investigate the effect on accuracy when using moving windows of various durations. Method Stepwise multivariate regression was used to build prediction models, using all available training data, and also using windows of various durations to select training data. Accuracy was compared based on absolute residuals and MREs; the Wilcoxon test was used to check statistical significances between results. Accuracy was also compared against estimates derived from windows containing fixed numbers of projects. Results Neither fixed size nor fixed duration windows provided superior estimation accuracy in the new data set. Conclusions Contrary to intuition, our results suggest that it is not always beneficial to exclude old data when estimating effort for new projects. When windows are helpful, windows based on duration are effective.

[1]  Jürgen Münch,et al.  Factors Influencing Software Development Productivity - State-of-the-Art and Industrial Experiences , 2009, Adv. Comput..

[2]  Barry W. Boehm,et al.  Software Engineering Economics , 1993, IEEE Transactions on Software Engineering.

[3]  Filomena Ferrucci,et al.  A Case Study Using Web Objects and COSMIC for Effort Estimation of Web Applications , 2008, 2008 34th Euromicro Conference Software Engineering and Advanced Applications.

[4]  Magne Jørgensen,et al.  A Systematic Review of Software Development Cost Estimation Studies , 2007 .

[5]  Emilia Mendes,et al.  Investigating the use of chronological splitting to compare software cross-company and single-company effort predictions , 2008 .

[6]  Stefan Biffl,et al.  Increasing the accuracy and reliability of analogy-based cost estimation with extensive project feature dimension weighting , 2004, Proceedings. 2004 International Symposium on Empirical Software Engineering, 2004. ISESE '04..

[7]  Sousuke Amasaki,et al.  The Effects of Moving Windows to Software Estimation: Comparative Study on Linear Regression and Estimation by Analogy , 2012, 2012 Joint Conference of the 22nd International Workshop on Software Measurement and the 2012 Seventh International Conference on Software Process and Product Measurement.

[8]  Emilia Mendes,et al.  Using Chronological Splitting to Compare Cross- and Single-company Effort Models: Further Investigation , 2009, ACSC.

[9]  Burak Turhan,et al.  On the dataset shift problem in software engineering prediction models , 2011, Empirical Software Engineering.

[10]  Sousuke Amasaki,et al.  Performance Evaluation of Windowing Approach on Effort Estimation by Analogy , 2011, 2011 Joint Conference of the 21st International Workshop on Software Measurement and the 6th International Conference on Software Process and Product Measurement.

[11]  Martin J. Shepperd,et al.  Using Genetic Programming to Improve Software Effort Estimation Based on General Data Sets , 2003, GECCO.

[12]  Emilia Mendes,et al.  Investigating the Use of Duration-Based Moving Windows to Improve Software Effort Prediction , 2012, 2012 19th Asia-Pacific Software Engineering Conference.

[13]  Shari Lawrence Pfleeger,et al.  An empirical study of maintenance and development estimation accuracy , 2002, J. Syst. Softw..

[14]  Emilia Mendes,et al.  Applying moving windows to software effort estimation , 2009, ESEM 2009.

[15]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[16]  H. E. Dunsmore,et al.  Software engineering metrics and models , 1986 .

[17]  Katrina D. Maxwell,et al.  Applied Statistics for Software Managers , 2002 .

[18]  Thomas J. Ostrand,et al.  \{PROMISE\} Repository of empirical software engineering data , 2007 .

[19]  Sanjay Mohapatra,et al.  Finding Factors Impacting Productivity in Software Development Project Using Structured Equation Modelling , 2011 .

[20]  T. Wright,et al.  Organizational Benchmarking Using the ISBSG Data Repository , 2001, IEEE Softw..

[21]  Ioannis Stamelos,et al.  Software productivity and effort prediction with ordinal regression , 2005, Inf. Softw. Technol..

[22]  Stephen G. MacDonell,et al.  Data accumulation and software effort prediction , 2010, ESEM '10.

[23]  Guilherme Horta Travassos,et al.  Cross versus Within-Company Cost Estimation Studies: A Systematic Review , 2007, IEEE Transactions on Software Engineering.

[24]  Adriano Bessa Albuquerque,et al.  Factors that Influence the Productivity of Software Developers in a Developer View , 2009, SCSS.

[25]  Stefan Biffl,et al.  Optimal project feature weights in analogy-based cost estimation: improvement and limitations , 2006 .

[26]  Emilia Mendes,et al.  Investigating the Use of Chronological Splitting to Compare Software Cross-company and Single-company Effort Predictions: A Replicated Study , 2009, EASE.

[27]  R. Cook Detection of influential observation in linear regression , 2000 .

[28]  Katrina Maxwell,et al.  Benchmarking Software-Development Productivity - Applied Research Results , 2000, IEEE Softw..

[29]  B. Tabachnick,et al.  Using Multivariate Statistics , 1983 .

[30]  D. Ross Jeffery,et al.  Cost estimation for web applications , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[31]  Frank E. Harrell,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2001 .