Multiple Logistic Regression as Imputation Method Applied on Software Error Prediction

A common problem in software cost estimation is the manipulation of incomplete or missing data in databases used for the development of prediction models. In such cases, the most popular and simple method of handling missing data is to ignore either the projects or the attributes with missing observations. This technique causes the loss of valuable information and therefore may lead to inaccurate cost estimation models. On the other hand, there are various imputation methods used to estimate the missing values in a data set. These methods are applied mainly on numerical data and produce continuous estimates. However, it is well known that the majority of the cost data sets contain software projects with mostly categorical attributes with many missing values. It is therefore reasonable to use some estimating method producing categorical rather than continuous values. The purpose of this paper is to investigate the possibility of using such a method for estimating categorical missing values in software cost databases. Specifically, the method known as Multinomial Logistic Regression (MLR) is suggested for imputation and is applied on projects of the ISBSG multiorganizational software database. Comparisons of MLR with other missing data techniques, such as listwise deletion (LD), mean imputation (MI), expectation maximization (EM) and regression imputation (RI) show that the proposed method is efficient, especially when the percentage of missing values is high.

[1]  D. Ross Jeffery,et al.  A comparative study of two software development cost modeling techniques using multi-organizational and company-specific data , 2000, Inf. Softw. Technol..

[2]  Victor R. Basili,et al.  A Pattern Recognition Approach for Software Engineering Data Analysis , 1992, IEEE Trans. Software Eng..

[3]  Khaled El Emam,et al.  Software Cost Estimation with Incomplete Data , 2001, IEEE Trans. Software Eng..

[4]  Khaled El Emam,et al.  Validating the ISO/IEC 15504 Measure of Software Requirements Analysis Process Capability , 2000, IEEE Trans. Software Eng..

[5]  Qinbao Song,et al.  A Short Note on Safest Default Missingness Mechanism Assumptions , 2004, Empirical Software Engineering.

[6]  Ingunn Myrtveit,et al.  Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods , 2001, IEEE Trans. Software Eng..

[7]  Qinbao Song,et al.  Dealing with missing software project data , 2003, Proceedings. 5th International Workshop on Enterprise Networking and Computing in Healthcare Industry (IEEE Cat. No.03EX717).

[8]  D. Rubin,et al.  Statistical Analysis with Missing Data , 1988 .

[9]  Barbara A. Kitchenham,et al.  A Simulation Study of the Model Evaluation Criterion MMRE , 2003, IEEE Trans. Software Eng..

[10]  Martin J. Shepperd,et al.  Estimating Software Project Effort Using Analogies , 1997, IEEE Trans. Software Eng..