A new imputation method for small software project data sets

Effort prediction is a very important issue for software project management. Historical project data sets are frequently used to support such prediction. But missing data are often contained in these data sets and this makes prediction more difficult. One common practice is to ignore the cases with missing data, but this makes the originally small software project database even smaller and can further decrease the accuracy of prediction. The alternative is missing data imputation. There are many imputation methods. Software data sets are frequently characterised by their small size but unfortunately sophisticated imputation methods prefer larger data sets. For this reason we explore using simple methods to impute missing data in small project effort data sets. We propose a class mean imputation (CMI) method based on the k-NN hot deck imputation method (MINI) to impute both continuous and nominal missing data in small data sets. We use an incremental approach to increase the variance of population. To evaluate MINI (and k-NN and CMI methods as benchmarks) we use data sets with 50 cases and 100 cases sampled from a larger industrial data set with 10%, 15%, 20% and 30% missing data percentages respectively. We also simulate Missing Completely at Random (MCAR) and Missing at Random (MAR) missingness mechanisms. The results suggest that the MINI method outperforms both CMI and the k-NN methods. We conclude that this new imputation technique can be used to impute missing values in small data sets.

[1]  Khaled El Emam,et al.  Software Cost Estimation with Incomplete Data , 2001, IEEE Trans. Software Eng..

[2]  Philip S. Yu,et al.  Data Mining: An Overview from a Database Perspective , 1996, IEEE Trans. Knowl. Data Eng..

[3]  P. Roth MISSING DATA: A CONCEPTUAL REVIEW FOR APPLIED PSYCHOLOGISTS , 1994 .

[4]  H. E. Dunsmore,et al.  Software engineering metrics and models , 1986 .

[5]  Foster J. Provost,et al.  A Survey of Methods for Scaling Up Inductive Algorithms , 1999, Data Mining and Knowledge Discovery.

[6]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[7]  Russell V. Lenth,et al.  Statistical Analysis With Missing Data (2nd ed.) (Book) , 2004 .

[8]  Y. Haitovsky Missing Data in Regression Analysis , 1968 .

[9]  Qinbao Song,et al.  Dealing with missing software project data , 2003, Proceedings. 5th International Workshop on Enterprise Networking and Computing in Healthcare Industry (IEEE Cat. No.03EX717).

[10]  J L Schafer,et al.  Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst's Perspective. , 1998, Multivariate behavioral research.

[11]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[12]  Qinbao Song,et al.  A Short Note on Safest Default Missingness Mechanism Assumptions , 2004, Empirical Software Engineering.

[13]  Khaled El Emam,et al.  de l ’ information Software Cost Estimation with Incomplete Data , 2000 .

[14]  Karl G. Jöreskog,et al.  Lisrel 8: User's Reference Guide , 1997 .

[15]  B. Tabachnick,et al.  Using Multivariate Statistics , 1983 .

[16]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[17]  B. Tabachnick,et al.  Using multivariate statistics, 5th ed. , 2007 .

[18]  John M. Abowd,et al.  Multiple Imputation , 2009, Encyclopedia of Database Systems.

[19]  Claes Wohlin,et al.  An evaluation of k-nearest neighbour imputation using Likert data , 2004, 10th International Symposium on Software Metrics, 2004. Proceedings..

[20]  Clifford C. Clogg,et al.  Handbook of statistical modeling for the social and behavioral sciences , 1995 .

[21]  Claes Wohlin,et al.  An evaluation of k-nearest neighbour imputation using Likert data , 2004 .

[22]  Ingunn Myrtveit,et al.  Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods , 2001, IEEE Trans. Software Eng..

[23]  Anil K. Jain,et al.  Feature Selection: Evaluation, Application, and Small Sample Performance , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[24]  Khaled El Emam,et al.  Validating the ISO/IEC 15504 Measure of Software Requirements Analysis Process Capability , 2000, IEEE Trans. Software Eng..

[25]  Jerzy W. Grzymala-Busse,et al.  A Comparison of Several Approaches to Missing Attribute Values in Data Mining , 2000, Rough Sets and Current Trends in Computing.

[26]  John L.P. Thompson,et al.  Missing data , 2004, Amyotrophic lateral sclerosis and other motor neuron disorders : official publication of the World Federation of Neurology, Research Group on Motor Neuron Diseases.

[27]  Kjetil Moløkken-Østvold,et al.  A review of software surveys on software effort estimation , 2003, 2003 International Symposium on Empirical Software Engineering, 2003. ISESE 2003. Proceedings..

[28]  Joop J. Hox,et al.  A review of current software for handling missing data , 1999 .

[29]  Rabab Kreidieh Ward,et al.  Reconstruction of baseline JPEG coded images in error prone environments , 2000, IEEE Trans. Image Process..

[30]  Michelle Cartwright,et al.  Issues on the Effective Use of CBR Technology for Software Project Prediction , 2001, ICCBR.

[31]  J. L. Hodges,et al.  Discriminatory Analysis - Nonparametric Discrimination: Small Sample Performance , 1952 .

[32]  Dale Schuurmans,et al.  Learning to classify incomplete examples , 1997, COLT 1997.

[33]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[34]  Russell Greiner Making learning systems practical , 1997 .

[35]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[36]  Charu C. Aggarwal,et al.  Mining massively incomplete data sets by conceptual reconstruction , 2001, KDD '01.

[37]  David E. Booth,et al.  Analysis of Incomplete Multivariate Data , 2000, Technometrics.

[38]  Jae-On Kim,et al.  The Treatment of Missing Data in Multivariate Analysis , 1977 .

[39]  Martin J. Shepperd,et al.  Search Heuristics, Case-based Reasoning And Software Project Effort Prediction , 2002, GECCO.

[40]  Kjetil Molkken,et al.  A Review of Surveys on Software Effort Estimation , 2003 .

[41]  Nir Friedman,et al.  The Bayesian Structural EM Algorithm , 1998, UAI.

[42]  R. Little A Test of Missing Completely at Random for Multivariate Data with Missing Values , 1988 .