Predicting feature imputability in the absence of ground truth

Data imputation is the most popular method of dealing with missing values, but in most real life applications, large missing data can occur and it is difficult or impossible to evaluate whether data has been imputed accurately (lack of ground truth). This paper addresses these issues by proposing an effective and simple principal component based method for determining whether individual data features can be accurately imputed - feature imputability. In particular, we establish a strong linear relationship between principal component loadings and feature imputability, even in the presence of extreme missingness and lack of ground truth. This work will have important implications in practical data imputation strategies.

[1]  Foster J. Provost,et al.  Handling Missing Values when Applying Classification Models , 2007, J. Mach. Learn. Res..

[2]  Magda Friedjungová,et al.  Missing Features Reconstruction and Its Impact on Classification Accuracy , 2019, ICCS.

[3]  KongFatt Wong-Lin,et al.  A hybrid computational approach for efficient Alzheimer’s disease classification based on heterogeneous data , 2018, Scientific Reports.

[4]  Kathleen Baynes,et al.  The measurement of everyday cognition (ECog): scale development and psychometric properties. , 2008, Neuropsychology.

[5]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[6]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[7]  J. Marrero,et al.  Comparison of imputation methods for missing laboratory data in medicine , 2013, BMJ Open.

[8]  H. Wold Soft Modelling by Latent Variables: The Non-Linear Iterative Partial Least Squares (NIPALS) Approach , 1975, Journal of Applied Probability.

[9]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[10]  D. Stekhoven missForest: Nonparametric missing value imputation using random forest , 2015 .

[11]  Ms. R. Malarvizhi,et al.  05-07 5 K-Nearest Neighbor in Missing Data Imputation , 2012 .

[12]  Zhongheng Zhang,et al.  Missing data imputation: focusing on single imputation. , 2016, Annals of translational medicine.

[13]  Peter Bühlmann,et al.  MissForest - non-parametric missing value imputation for mixed-type data , 2011, Bioinform..

[14]  D. Molloy,et al.  A Guide to the Standardized Mini-Mental State Examination , 1997, International Psychogeriatrics.

[15]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[16]  J. Cummings,et al.  The Montreal Cognitive Assessment, MoCA: A Brief Screening Tool For Mild Cognitive Impairment , 2005, Journal of the American Geriatrics Society.

[17]  Joachim Selbig,et al.  pcaMethods - a bioconductor package providing PCA methods for incomplete data , 2007, Bioinform..

[18]  M. Baneshi,et al.  Does the Missing Data Imputation Method Affect the Composition and Performance of Prognostic Models? , 2012, Iranian Red Crescent medical journal.

[19]  B. Bakshi,et al.  Bayesian principal component analysis , 2002 .

[20]  Muni S. Srivastava,et al.  Multiple imputation and other resampling schemes for imputing missing observations , 2009, J. Multivar. Anal..