An evaluation of k-nearest neighbour imputation using Likert data

Studies in many different fields of research suffer from the problem of missing data. With missing data, statistical tests will lose power, results may be biased, or analysis may not be feasible at all. There are several ways to handle the problem, for example through imputation. With imputation, missing values are replaced with estimated values according to an imputation method or model. In the k-nearest neighbour (k-NN) method, a case is imputed using values from the k most similar cases. In this paper, we present an evaluation of the k-NN method using Likert data in a software engineering context. We simulate the method with different values of k and for different percentages of missing data. Our findings indicate that it is feasible to use the k-NN method with Likert data. We suggest that a suitable value of k is approximately the square root of the number of complete cases. We also show that by relaxing the method rules with respect to selecting neighbours, the ability of the method remains high for large amounts of missing data without affecting the quality of the imputation.

[1]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[2]  Tony R. Martinez,et al.  Improved Heterogeneous Distance Functions , 1996, J. Artif. Intell. Res..

[3]  R. Downey,et al.  Missing data in Likert ratings: A comparison of replacement methods. , 1998, The Journal of general psychology.

[4]  Q. Raaijmakers,et al.  Effectiveness of Different Missing Data Treatments in Surveys with Likert-Type Data: Introducing the Relative Mean Substitution Approach , 1999 .

[5]  Mark Huisman,et al.  Imputation of Missing Item Responses: Some Simple Techniques , 2000 .

[6]  J. Shao,et al.  Nearest Neighbor Imputation for Survey Data , 2000 .

[7]  Edith D. de Leeuw,et al.  Reducing missing data in surveys: an overview of methods , 2001 .

[8]  Khaled El Emam,et al.  Software Cost Estimation with Incomplete Data , 2001, IEEE Trans. Software Eng..

[9]  Gustavo E. A. P. A. Batista,et al.  Astudy of k-nearest neighbour as a model-based method to treat missing data , 2001 .

[10]  Ingunn Myrtveit,et al.  Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods , 2001, IEEE Trans. Software Eng..

[11]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[12]  Mingxiu Hu,et al.  EVALUATION OF SOME POPULAR IMPUTATION ALGORITHMS , 2002 .

[13]  Qinbao Song,et al.  Dealing with missing software project data , 2003, Proceedings. 5th International Workshop on Enterprise Networking and Computing in Healthcare Industry (IEEE Cat. No.03EX717).

[14]  Paula Diehr,et al.  Imputation of missing longitudinal data: a comparison of methods. , 2003, Journal of clinical epidemiology.

[15]  Ivo Düntsch,et al.  Maximum Consistency of Incomplete Data via Non-Invasive Imputation , 2004, Artificial Intelligence Review.

[16]  Qinbao Song,et al.  A Short Note on Safest Default Missingness Mechanism Assumptions , 2004, Empirical Software Engineering.

[17]  Claes Wohlin,et al.  Understanding the Importance of Roles in Architecture-Related Process Improvement - A Case Study , 2005, PROFES.