Dealing with missing software project data

Whilst there is a general consensus that quantitative approaches are an important part of successful software project management, there has been relatively little research into many of the obstacles to data collection and analysis in the real world. One feature that characterises many of the data sets we deal with is missing or highly questionable values. Naturally this problem is not unique to software engineering, so we explore the application of two existing data imputation techniques that have been used to good effect elsewhere. In order to assess the potential value of imputation we use two industrial data sets. Both are quite problematic from an effort modelling perspective because they contain few cases, have a significant number of missing values and the projects are quite heterogeneous. We examine the quality of fit of effort models derived by stepwise regression on the raw data and data sets with values imputed by various techniques is compared. In both data sets we find that k-nearest neighbour (k-NN) and sample mean imputation (SMI) significantly improve the model fit, with k-NN giving the best results. These results are consistent with other recently published results, consequently we conclude that imputation can assist empirical software engineering.

[1]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[2]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[3]  Khaled El Emam,et al.  Validating the ISO/IEC 15504 Measure of Software Requirements Analysis Process Capability , 2000, IEEE Trans. Software Eng..

[4]  Ingunn Myrtveit,et al.  Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods , 2001, IEEE Trans. Software Eng..

[5]  H. E. Dunsmore,et al.  Software engineering metrics and models , 1986 .

[6]  Stephen G. MacDonell Metrics for database systems: an empirical study , 1997, Proceedings Fourth International Software Metrics Symposium.

[7]  Ronald Gulezian Reformulating and calibrating COCOMO , 1991, J. Syst. Softw..

[8]  Ingunn Myrtveit,et al.  Assessing the benefits of imputing ERP projects with missing data , 2001, Proceedings Seventh International Software Metrics Symposium.

[9]  R. Little A Test of Missing Completely at Random for Multivariate Data with Missing Values , 1988 .

[10]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[11]  Khaled El Emam,et al.  Software Cost Estimation with Incomplete Data , 2001, IEEE Trans. Software Eng..

[12]  J. Neter,et al.  Applied linear statistical models : regression, analysis of variance, and experimental designs , 1974 .

[13]  Lionel C. Briand,et al.  Using the European Space Agency data set: a replicated assessment and comparison of common software , 2000 .

[14]  Mark C. Paulk,et al.  Capability Maturity Model , 1991 .

[15]  Stephen G. MacDonell,et al.  What accuracy statistics really measure , 2001, IEE Proc. Softw..

[16]  Michelle Cartwright,et al.  Issues on the Effective Use of CBR Technology for Software Project Prediction , 2001, ICCBR.

[17]  D. Ross Jeffery,et al.  A framework for evaluation and prediction of metrics program success , 1993, [1993] Proceedings First International Software Metrics Symposium.

[18]  Norman E. Fenton,et al.  Implementing Effective Software Metrics Programs , 1997, IEEE Softw..

[19]  Martin J. Shepperd,et al.  Estimating Software Project Effort Using Analogies , 1997, IEEE Trans. Software Eng..

[20]  J L Schafer,et al.  Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst's Perspective. , 1998, Multivariate behavioral research.