A Comparison of Six Methods for Missing Data Imputation

Missing data are part of almost all research and introduce an element of ambiguity into data analysis. It follows that we need to consider them appropriately in order to provide an efficient and valid analysis. In the present study, we compare 6 different imputation methods: Mean, K-nearest neighbors (KNN), fuzzy K-means (FKM), singular value decomposition (SVD), bayesian principal component analysis (bPCA) and multiple imputations by chained equations (MICE). Comparison was performed on four real datasets of various sizes (from 4 to 65 variables), under a missing completely at random (MCAR) assumption, and based on four evaluation criteria: Root mean squared error (RMSE), unsupervised classification error (UCE), supervised classification error (SCE) and execution time. Our results suggest that bPCA and FKM are two imputation methods of interest which deserve further consideration in practice.

[1]  Francisco Herrera,et al.  On the choice of the best imputation methods for missing values considering three groups of classification methods , 2012, Knowledge and Information Systems.

[2]  Serge A. Hazout,et al.  Influence of microarrays experiments missing values on the stability of gene groups by hierarchical clustering , 2004, BMC Bioinformatics.

[3]  D. Rubin,et al.  Multiple Imputation for Nonresponse in Surveys , 1989 .

[4]  D. Rubin Multiple imputation for nonresponse in surveys , 1989 .

[5]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[6]  Guy N. Brock,et al.  Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes , 2008, BMC Bioinformatics.

[7]  Lewis Hd,et al.  Missing data in clinical trials. , 2012 .

[8]  Jitender S. Deogun,et al.  Towards Missing Data Imputation: A Study of Fuzzy K-means Clustering Method , 2004, Rough Sets and Current Trends in Computing.

[9]  Enola K. Proctor,et al.  Imputing Missing Data: A Comparison of Methods for Social Work Researchers , 2006 .

[10]  S. Lipsitz,et al.  Missing-Data Methods for Generalized Linear Models , 2005 .

[11]  Yudong D. He,et al.  A Gene-Expression Signature as a Predictor of Survival in Breast Cancer , 2002 .

[12]  Paul Horton,et al.  A Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins , 1996, ISMB.

[13]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data , 1988 .

[14]  Edward R. Dougherty,et al.  Impact of Missing Value Imputation on Classification for DNA Microarray Gene Expression Data—A Model-Based Study , 2010, EURASIP J. Bioinform. Syst. Biol..

[15]  T. Schneider Analysis of Incomplete Climate Data: Estimation of Mean Values and Covariance Matrices and Imputation of Missing Values. , 2001 .

[16]  Birgit Kadastik Missing data in clinical trials , 2016 .

[17]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[18]  Ken P Kleinman,et al.  Much Ado About Nothing , 2007, The American statistician.

[19]  Anne-Laure Boulesteix,et al.  A Plea for Neutral Comparison Studies in Computational Sciences , 2012, PloS one.

[20]  Jing Zhu,et al.  Effects of replacing the unreliable cDNA microarray measurements on the disease classification based on gene expression profiles and functional modules , 2006, Bioinform..

[21]  A. Malpertuy,et al.  Comparative analysis of missing value imputation methods to improve clustering and interpretation of microarray experiments , 2010, BMC Genomics.

[22]  Hong Yan,et al.  Missing value imputation for gene expression data: computational techniques to recover missing data from available information , 2011, Briefings Bioinform..

[23]  Shin Ishii,et al.  A Bayesian missing value estimation method for gene expression profile data , 2003, Bioinform..

[24]  F. Bertucci,et al.  A refined molecular taxonomy of breast cancer , 2011, Oncogene.

[25]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .