de l ’ information Software Cost Estimation with Incomplete Data

The construction of software cost estimation models remains an active topic of research. The basic premise of cost modelling is that a historical database of software project cost data can be used to develop a quantitative model to predict the cost of future projects. One of the difficulties faced by workers in this area is that many of these historical databases contain substantial amounts of missing data. Thus far, the common practice has been to ignore observations with missing data. In principle, such a practice can lead to gross biases, and may be detrimental to the accuracy of cost estimation models. In this paper we describe an extensive simulation where we evaluate different techniques for dealing with missing data in the context of software cost modelling. Three techniques are evaluated: listwise deletion, mean imputation and eight different types of hot-deck imputation. Our results indicate that all the missing data techniques perform well, with small biases and high precision. This suggests that the simplest technique, listwise deletion, is a reasonable choice. However, this will not necessarily provide the best performance. Consistent best performance (minimal bias and highest precision) can be obtained by using hot-deck imputation with Euclidean distance and a z-score standardisation.

[1]  Jae-On Kim,et al.  The Treatment of Missing Data in Multivariate Analysis , 1977 .

[2]  N. Nakagawa,et al.  Method to estimate parameter values in software prediction models , 1991 .

[3]  Barry Boehm,et al.  The effects of software process maturity on software development effort , 1997 .

[4]  H. Weisberg Central tendency and variability , 1991 .

[5]  Rajiv D. Banker,et al.  Scale Economies in New Software Development , 2013, IEEE Transactions on Software Engineering.

[6]  Bernice W. Polemis Nonparametric Statistics for the Behavioral Sciences , 1959 .

[7]  R. Little Missing-Data Adjustments in Large Surveys , 1988 .

[8]  S. S. Stevens Mathematics, measurement, and psychophysics. , 1951 .

[9]  D. Ross Jeffery,et al.  Cost Estimation : A Review of Models , Process , and Practice , 2010 .

[10]  Ross Jeffery,et al.  A comparative Study of Cost Modelling Techniques using Public Domain multi-organisational and company-specific Data , 2000 .

[11]  Victor R. Basili,et al.  A Pattern Recognition Approach for Software Engineering Data Analysis , 1992, IEEE Trans. Software Eng..

[12]  Robert P. Leone,et al.  A two-stage imputation procedure for item nonresponse in surveys , 1991 .

[13]  P. Gardner Scales and Statistics , 1975 .

[14]  Y. Haitovsky Missing Data in Regression Analysis , 1968 .

[15]  Jean-Marc Desharnais,et al.  A comparison of software effort estimation techniques: Using function points with neural networks, case-based reasoning and regression models , 1997, J. Syst. Softw..

[16]  Stephen G. MacDonell,et al.  A comparison of techniques for developing predictive models of software metrics , 1997, Inf. Softw. Technol..

[17]  Blake Ives,et al.  The measurement of user information satisfaction , 1983, CACM.

[18]  Mark R. Raymond,et al.  A Comparison of Methods for Treating Incomplete Data in Selection Research , 1987 .

[19]  Curtis D. Hardyck,et al.  Weak Measurements vs. Strong Statistics: An Empirical Critique of S. S. Stevens' Proscriptions nn Statistics , 1966 .

[20]  Roderick J. A. Little Regression with Missing X's: A Review , 1992 .

[21]  Barry W. Boehm,et al.  Calibrating the COCOMO II Post-Architecture model , 1998, Proceedings of the 20th International Conference on Software Engineering.

[22]  Leonard R. Sussman,et al.  Nominal, Ordinal, Interval, and Ratio Typologies are Misleading , 1993 .

[23]  Lionel C. Briand,et al.  A Comprehensive Evaluation of Capture-Recapture Models for Estimating Software Defect Content , 2000, IEEE Trans. Software Eng..

[24]  Douglas Fisher,et al.  Machine Learning Approaches to Estimating Software Development Effort , 1995, IEEE Trans. Software Eng..

[25]  J. Kaiser The Effectiveness of Hot-deck Procedures in Small Samples. , 1983 .

[26]  D. Rubin,et al.  Statistical Analysis with Missing Data , 1988 .

[27]  N. H. Madhavji,et al.  An effort estimation model for implementing ISO 9001 , 1995, Proceedings of Software Engineering Standards Symposium.

[28]  J R Landis,et al.  Strategies for the Analysis of Imputed Data From a Sample Survey: The National Medical Care Utilization and Expenditure Survey , 1987, Medical care.

[29]  Chris F. Kemerer,et al.  An empirical validation of software cost estimation models , 1987, CACM.

[30]  Sunder Kekre,et al.  Software Effort Models for Early Estimation of Process Control Applications , 1992, IEEE Trans. Software Eng..

[31]  Rajiv D. Banker,et al.  Evidence on economies of scale in software development , 1994, Inf. Softw. Technol..

[32]  John C. Bailer,et al.  COMPARISON OF TWO PROCEDURES FOR IMPUTING MISSING SUR%~Y VALUES , 2002 .

[33]  Barbara A. Kitchenham,et al.  Software project development cost estimation , 1985, J. Syst. Softw..

[34]  J. H. Johnson,et al.  LARGE SCALE IMPUTATION OF SURVEY DATA , 2002 .

[35]  Claude E. Walston,et al.  A Method of Programming Measurement and Estimation , 1977, IBM Syst. J..

[36]  Soumitra Dutta,et al.  Performance Evaluation of General and Company Specific Models in Software Development Effort Estimation , 1999 .

[37]  Sanford Labovitz,et al.  Some Observations on Measurement and Statistics , 1967 .

[38]  G. Bohrnstedt,et al.  Robustness in Regression Analysis , 1971 .

[39]  Rajiv D. Banker,et al.  A model to evaluate variables impacting the productivity of software maintenance projects , 1991 .

[40]  Joseph M. Mellichamp,et al.  Software Development Cost Estimation Using Function Points , 1994, IEEE Trans. Software Eng..

[41]  G. W. Milligan,et al.  A study of standardization of variables in cluster analysis , 1988 .

[42]  David A. Belsley,et al.  Regression Analysis and its Application: A Data-Oriented Approach.@@@Applied Linear Regression.@@@Regression Diagnostics: Identifying Influential Data and Sources of Collinearity , 1981 .

[43]  Lionel C. Briand,et al.  Quantitative Empirical Modeling for Manageing Software Development: Constraints, Needs and Solutions , 1992, Experimental Software Engineering Issues.

[44]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[45]  D. Heitjan,et al.  Annotation: what can be done about missing data? Approaches to imputation. , 1997, American journal of public health.

[46]  Martin J. Shepperd,et al.  Estimating Software Project Effort Using Analogies , 1997, IEEE Trans. Software Eng..

[47]  Fritz Scheuren,et al.  Hot Deck Imputation Procedure Applied to Double Sampling Design , 1986 .

[48]  D BankerRajiv,et al.  A Model to Evaluate Variables Impacting the Productivity of Software Maintenance Projects , 1991 .

[49]  Lionel C. Briand,et al.  An assessment and comparison of common software cost estimation modeling techniques , 1999, Proceedings of the 1999 International Conference on Software Engineering (IEEE Cat. No.99CB37002).

[50]  R. Folsom,et al.  An Empirical Investigation of Alternative Item Nonresponse Adjustment Procedures. National Longitudinal Study, Sponsored Reports Series. , 1979 .

[51]  Sanford Labovitz,et al.  The Assignment of Numbers to Rank Order Categories , 1970 .

[52]  Paul E. Spector Ratings of Equal and Unequal Response Choice Intervals , 1980 .

[53]  H. E. Dunsmore,et al.  Software engineering metrics and models , 1986 .

[54]  Sik-Yum Lee,et al.  Analysis of multivariate polychoric correlation models with incomplete data , 1990 .

[55]  Qing Hu Evaluating Alternative Software Production Functions , 1997, IEEE Trans. Software Eng..

[56]  Barbara A. Kitchenham,et al.  Comments on: Evaluating Alternative Software Production Functions , 1999, IEEE Trans. Software Eng..

[57]  Girish H. Subramanian,et al.  Dimensionality reduction in software development effort estimation , 1993, J. Syst. Softw..

[58]  Alain Abran,et al.  An empirircal assessment of project duration models in software engineering , 1996 .

[59]  Lionel C. Briand,et al.  A replicated assessment and comparison of common software cost modeling techniques , 2000, Proceedings of the 2000 International Conference on Software Engineering. ICSE 2000 the New Millennium.

[60]  George M. Furnival,et al.  Regressions by leaps and bounds , 2000 .

[61]  Ken Baker,et al.  Data Fusion: An Appraisal and Experimental Evaluation , 1997 .

[62]  Y. Miyazaki,et al.  Robust regression for developing software estimation models , 1994, J. Syst. Softw..

[63]  John E. Gaffney,et al.  Software Function, Source Lines of Code, and Development Effort Prediction: A Software Science Validation , 1983, IEEE Transactions on Software Engineering.

[64]  Fred S. Switzer,et al.  A Monte Carlo Analysis of Missing Data Techniques in a HRM Setting , 1995 .

[65]  Lionel C. Briand,et al.  Explaining the cost of European space and military projects , 1999, Proceedings of the 1999 International Conference on Software Engineering (IEEE Cat. No.99CB37002).

[66]  Naresh K. Malhotra,et al.  Analyzing Marketing Research Data with Incomplete Information on the Dependent Variable , 1987 .

[67]  Barry W. Boehm,et al.  Software Engineering Economics , 1993, IEEE Transactions on Software Engineering.

[68]  J. Frane Some simple procedures for handling missing data in multivariate analysis , 1976 .

[69]  Alan Myrvold Data analysis for software metrics , 1990, J. Syst. Softw..

[70]  Barbara A. Kitchenham,et al.  Empirical studies of assumptions that underlie software cost-estimation models , 1992, Inf. Softw. Technol..

[71]  M. Raymond Missing Data in Evaluation Research , 1986 .

[72]  Constance V. Hines,et al.  Nonrandomly Missing Data in Multiple Regression: An Empirical Comparison of Common Missing-Data Treatments , 1991 .

[73]  Victor R. Basili,et al.  A meta-model for software development resource expenditures , 1981, ICSE '81.

[74]  Fred P. Brooks,et al.  The Mythical Man-Month , 1975, Reliable Software.

[75]  P. Roth MISSING DATA: A CONCEPTUAL REVIEW FOR APPLIED PSYCHOLOGISTS , 1994 .

[76]  Roger A. Sugden,et al.  Multiple Imputation for Nonresponse in Surveys , 1988 .