An Empirical Study of Dynamic Incomplete-Case Nearest Neighbor Imputation in Software Quality Data

Software quality prediction is an important yet difficult problem in software project development and management. Historical datasets can be used to build models for software quality prediction. However, the missing data significantly affects the prediction ability of models in knowledge discovery. Instead of ignoring missing observations, we investigate and improve incomplete-case k-nearest neighbor based imputation. K-nearest neighbor imputation is widely applied but has rarely been improved to have the most appropriate parameter settings for each imputation. This work conducts imputation on four well-known software quality datasets to discover the impact of the new imputation method we proposed. We compare it with mean imputation and other commonly used versions of k-nearest neighbor imputation. The empirical results show that the proposed dynamic incomplete-case nearest neighbor imputation performs better when the missingness is completely at random or non-ignorable, regardless of the percentage of missing values.

[1]  Siti Zaiton Mohd Hashim,et al.  A PSO-based model to increase the accuracy of software development effort estimation , 2012, Software Quality Journal.

[2]  Qinbao Song,et al.  Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation , 2008, J. Syst. Softw..

[3]  Ayse Basar Bener,et al.  Exploiting the Essential Assumptions of Analogy-Based Effort Estimation , 2012, IEEE Transactions on Software Engineering.

[4]  Emilia Mendes,et al.  A Comparative Study of Cost Estimation Models for Web Hypermedia Applications , 2003, Empirical Software Engineering.

[5]  Tim Menzies,et al.  Active learning and effort estimation: Finding the essential content of software effort estimation data , 2013, IEEE Transactions on Software Engineering.

[6]  Martin Höst,et al.  A Snapshot of the State of Practice in Software Development for Medical Devices , 2007, ESEM 2007.

[7]  Khaled El Emam,et al.  Software Cost Estimation with Incomplete Data , 2001, IEEE Trans. Software Eng..

[8]  D. Ross Jeffery,et al.  Analogy-X: Providing Statistical Inference to Analogy-Based Software Cost Estimation , 2008, IEEE Transactions on Software Engineering.

[9]  Qinbao Song,et al.  A new imputation method for small software project data sets , 2007, J. Syst. Softw..

[10]  Chin-Yu Huang,et al.  Comparison of weighted grey relational analysis for software effort estimation , 2011, Software Quality Journal.

[11]  Ulrike von Luxburg,et al.  Optimal construction of k-nearest-neighbor graphs for identifying noisy clusters , 2009, Theoretical Computer Science.

[12]  Qinbao Song,et al.  Dealing with missing software project data , 2003, Proceedings. 5th International Workshop on Enterprise Networking and Computing in Healthcare Industry (IEEE Cat. No.03EX717).

[13]  D. Ross Jeffery,et al.  An Empirical Study of Analogy-based Software Effort Estimation , 1999, Empirical Software Engineering.

[14]  Thong Ngee Goh,et al.  A study of project selection and feature weighting for analogy based software cost estimation , 2009, J. Syst. Softw..

[15]  Claes Wohlin,et al.  Benchmarking k-nearest neighbour imputation with homogeneous Likert data , 2006, Empirical Software Engineering.

[16]  Lefteris Angelis,et al.  Categorical missing data imputation for software cost estimation by multinomial logistic regression , 2006, J. Syst. Softw..

[17]  Mohammad Azzeh A replicated assessment and comparison of adaptation techniques for analogy-based effort estimation , 2011, Empirical Software Engineering.

[18]  Thong Ngee Goh,et al.  A study of mutual information based feature selection for case based reasoning in software cost estimation , 2009, Expert Syst. Appl..

[19]  Barbara A. Kitchenham,et al.  A Simulation Study of the Model Evaluation Criterion MMRE , 2003, IEEE Trans. Software Eng..

[20]  Bhekisipho Twala,et al.  Comparison of various methods for handling incomplete data in software engineering databases , 2005, 2005 International Symposium on Empirical Software Engineering, 2005..

[21]  Günther Ruhe,et al.  Impact Analysis of Missing Values on the Prediction Accuracy of Analogy-based Software Effort Estimation Method AQUA , 2007, First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007).

[22]  Ingunn Myrtveit,et al.  Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods , 2001, IEEE Trans. Software Eng..

[23]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[24]  Qinbao Song,et al.  A Short Note on Safest Default Missingness Mechanism Assumptions , 2004, Empirical Software Engineering.

[25]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[26]  Taghi M. Khoshgoftaar,et al.  Incomplete-Case Nearest Neighbor Imputation in Software Measurement Data , 2007, 2007 IEEE International Conference on Information Reuse and Integration.

[27]  Sanmay Das,et al.  Filters, Wrappers and a Boosting-Based Hybrid for Feature Selection , 2001, ICML.