How efficient is estimation with missing data?

In this paper, we present a new evaluation approach for missing data techniques (MDTs) where the efficiency of those are investigated using listwise deletion method as reference. We experiment on classification problems and calculate misclassification rates (MR) for different missing data percentages (MDP) using a missing completely at random (MCAR) scheme. We compare three MDTs: pairwise deletion (PW), mean imputation (MI) and a maximum likelihood method that we call complete expectation maximization (CEM). We use a synthetic dataset, the Iris dataset and the Pima Indians Diabetes dataset. We train a Gaussian mixture model (GMM). We test the trained GMM for two cases, in which test dataset is missing or complete. The results show that CEM is the most efficient method in both cases while MI is the worst performer of the three. PW and CEM proves to be more stable, in particular for higher MDP values than MI.

[1]  Graham K. Rand,et al.  Quantitative Applications in the Social Sciences , 1983 .

[2]  J. Schafer,et al.  Missing data: our view of the state of the art. , 2002, Psychological methods.

[3]  Ingunn Myrtveit,et al.  Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods , 2001, IEEE Trans. Software Eng..

[4]  T. Stijnen,et al.  Review: a gentle introduction to imputation of missing values. , 2006, Journal of clinical epidemiology.

[5]  Lars Kai Hansen,et al.  Modeling text with generalizable Gaussian mixtures , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[6]  T. Schneider Analysis of Incomplete Climate Data: Estimation of Mean Values and Covariance Matrices and Imputation of Missing Values. , 2001 .

[7]  SuTing,et al.  In search of deterministic methods for initializing K-means and Gaussian mixture clustering , 2007 .

[8]  P. Roth MISSING DATA: A CONCEPTUAL REVIEW FOR APPLIED PSYCHOLOGISTS , 1994 .

[9]  D. Rubin,et al.  Multiple Imputation for Nonresponse in Surveys , 1989 .

[10]  D. Rubin Multiple imputation for nonresponse in surveys , 1989 .

[11]  Daniel A. Newman Longitudinal Modeling with Randomly and Systematically Missing Data: A Simulation of Ad Hoc, Maximum Likelihood, and Multiple Imputation Techniques , 2003 .

[12]  Michael I. Jordan,et al.  Supervised learning from incomplete data via an EM approach , 1993, NIPS.

[13]  Lars Kai Hansen,et al.  Probabilistic Hierarchical Clustering with Labeled and Unlabeled Data , 2001 .

[14]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[15]  Ting Su,et al.  In search of deterministic methods for initializing K-means and Gaussian mixture clustering , 2007, Intell. Data Anal..

[16]  D. Rubin INFERENCE AND MISSING DATA , 1975 .