Missing Data Imputation Techniques

Intelligent data analysis techniques are useful for better exploring real-world data sets. However, the real-world data sets always are accompanied by missing data that is one major factor affecting data quality. At the same time, good intelligent data exploration requires quality data. Fortunately, Missing Data Imputation Techniques (MDITs) can be used to improve data quality. However, no one method MDIT can be used in all conditions, each method has its own context. In this paper, we introduce the MDITs to the KDD and machine learning communities by presenting the basic idea and highlighting the advantages and limitations of each method.

[1]  Michael I. Jordan,et al.  On Convergence Properties of the EM Algorithm for Gaussian Mixtures , 1996, Neural Computation.

[2]  D. Rubin Formalizing Subjective Notions about the Effect of Nonrespondents in Sample Surveys , 1977 .

[3]  B. Efron Bootstrap Methods: Another Look at the Jackknife , 1979 .

[4]  J. Shao,et al.  Jackknife variance estimation with survey data under hot deck imputation , 1992 .

[5]  R. Little Missing-Data Adjustments in Large Surveys , 1988 .

[6]  J L Schafer,et al.  Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst's Perspective. , 1998, Multivariate behavioral research.

[7]  M. Wulfsohn,et al.  Modeling the Relationship of Survival to Longitudinal Data Measured with Error. Applications to Survival and CD4 Counts in Patients with AIDS , 1995 .

[8]  J. Robins,et al.  Analysis of semiparametric regression models for repeated outcomes in the presence of missing data , 1995 .

[9]  James M. Robins,et al.  Semiparametric Regression for Repeated Outcomes With Nonignorable Nonresponse , 1998 .

[10]  D. Rubin,et al.  Multiple Imputation for Interval Estimation from Simple Random Samples with Ignorable Nonresponse , 1986 .

[11]  Sylvia Richardson,et al.  Markov Chain Monte Carlo in Practice , 1997 .

[12]  Donald Hedeker,et al.  Application of random-efiects pattern-mixture models for miss-ing data in longitudinal studies , 1997 .

[13]  William E. Becker,et al.  Data Loss from Pretest to Posttest as a Sample Selection Problem , 1990 .

[14]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[15]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[16]  Melanie Hilario,et al.  Supervised knowledge discovery from incomplete data , 2000 .

[17]  Allan Donner,et al.  The Relative Effectiveness of Procedures Commonly Used in Multiple Regression Analysis for Dealing with Missing Values , 1982 .

[18]  New York Dover,et al.  ON THE CONVERGENCE PROPERTIES OF THE EM ALGORITHM , 1983 .

[19]  Jun S. Liu,et al.  Sequential Imputations and Bayesian Missing Data Problems , 1994 .

[20]  M. Tanner Tools for statistical inference: methods for the exploration of posterior distributions and likeliho , 1994 .

[21]  K. Bailey,et al.  Analysing changes in the presence of informative right censoring caused by death and withdrawal. , 1988, Statistics in medicine.

[22]  Ingunn Myrtveit,et al.  Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods , 2001, IEEE Trans. Software Eng..

[23]  Mark J. van der Laan Adjusting for Nonignorable Drop-Out Using Semiparametric Nonresponse Models: Comment , 1999 .

[24]  Zvi Griliches,et al.  ECONOMIC DATA ISSUES , 1986 .

[25]  W J Krzanowski,et al.  Missing value imputation in multivariate data using the singular value decomposition of a matrix , 1988 .

[26]  D B Dunson,et al.  Factor Analytic Models of Clustered Multivariate Data with Informative Censoring , 2001, Biometrics.

[27]  Tariq Samad,et al.  Imputation of Missing Data in Industrial Databases , 1999, Applied Intelligence.

[28]  Donald B. Rubin,et al.  Selection Modeling Versus Mixture Modeling with Nonignorable Nonresponse , 1986 .

[29]  Roderick J. A. Little,et al.  Modeling the Drop-Out Mechanism in Repeated-Measures Studies , 1995 .

[30]  Donald B. Rubin,et al.  EM and beyond , 1991 .

[31]  R. R. Hocking,et al.  The analysis of incomplete data. , 1971 .

[32]  Khaled El Emam,et al.  Software Cost Estimation with Incomplete Data , 2001, IEEE Trans. Software Eng..

[33]  Alain Monfort,et al.  On the Problem of Missing Data in Linear Models , 1981 .

[34]  Roderick J. A. Little Regression with Missing X's: A Review , 1992 .

[35]  Joseph Sedransk,et al.  A Bayesian Procedure for Imputing Missing Values in Sample Surveys , 1986 .

[36]  J. L. Hodges,et al.  Discriminatory Analysis - Nonparametric Discrimination: Small Sample Performance , 1952 .

[37]  M. Kenward,et al.  Informative Drop‐Out in Longitudinal Data Analysis , 1994 .

[38]  R. Little A Test of Missing Completely at Random for Multivariate Data with Missing Values , 1988 .

[39]  Myunghee C. Paik,et al.  The generalized estimating equation approach when data are not missing completely at random , 1997 .

[40]  G. Molenberghs,et al.  Linear Mixed Models for Longitudinal Data , 2001 .

[41]  A. Kong,et al.  Sequential imputation and multipoint linkage analysis , 1993, Genetic epidemiology.

[42]  Y. Haitovsky Missing Data in Regression Analysis , 1968 .

[43]  Neil C. Schwertman,et al.  Computation of the mean vector and dispersion matrix for incomplete multivariate data , 1980 .

[44]  D. Rubin Multiple Imputation After 18+ Years , 1996 .

[45]  Jun S. Liu Nonparametric hierarchical Bayes via sequential imputations , 1996 .

[46]  T. Amemiya Tobit models: A survey , 1984 .

[47]  P. Allison Multiple Imputation for Missing Data , 2000 .

[48]  J. Schafer,et al.  Missing data: our view of the state of the art. , 2002, Psychological methods.

[49]  Xiaohui Liu,et al.  Data mining from 1994 to 2004: an application-orientated review , 2005, Int. J. Bus. Intell. Data Min..

[50]  Donald B. Rubin,et al.  The Design of a General and Flexible System for Handling Nonresponse in Sample Surveys , 2004 .

[51]  Qinbao Song,et al.  A new imputation method for small software project data sets , 2007, J. Syst. Softw..

[52]  J. Heckman The Common Structure of Statistical Models of Truncation, Sample Selection and Limited Dependent Variables and a Simple Estimator for Such Models , 1976 .

[53]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[54]  R. Little,et al.  Pattern-mixture models for multivariate incomplete data with covariates. , 1996, Biometrics.

[55]  R. Little Pattern-Mixture Models for Multivariate Incomplete Data , 1993 .

[56]  Michael I. Jordan,et al.  Convergence results for the EM approach to mixtures of experts architectures , 1995, Neural Networks.

[57]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data , 1988 .

[58]  S. S. Wilks Moments and Distributions of Estimates of Population Parameters from Fragmentary Samples , 1932 .

[59]  D. Conniffe,et al.  Comments on the Weighted Regression Approach to Missing Values , 1983 .

[60]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[61]  Jerzy W. Grzymala-Busse,et al.  A Comparison of Several Approaches to Missing Attribute Values in Data Mining , 2000, Rough Sets and Current Trends in Computing.

[62]  J. Schafer,et al.  On the performance of multiple imputation for multivariate data with small sample size , 1999 .

[63]  G. Kalton,et al.  The treatment of missing survey data , 1986 .

[64]  R D Gill,et al.  Non-response models for the analysis of non-monotone ignorable missing data. , 1997, Statistics in medicine.

[65]  K. Bailey,et al.  Estimation and comparison of changes in the presence of informative right censoring: conditional linear model. , 1989, Biometrics.

[66]  N M Laird,et al.  Mixture models for the joint distribution of repeated measures and event times. , 1997, Statistics in medicine.

[67]  Jae-On Kim,et al.  The Treatment of Missing Data in Multivariate Analysis , 1977 .

[68]  P W Lavori,et al.  A multiple imputation strategy for clinical trials with truncation of patient data. , 1995, Statistics in medicine.

[69]  Martin Abba Tanner,et al.  Tools for Statistical Inference: Observed Data and Data Augmentation Methods , 1993 .

[70]  D. Rubin The Bayesian Bootstrap , 1981 .

[71]  Yang C. Yuan,et al.  Multiple Imputation for Missing Data: Concepts and New Development , 2000 .

[72]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[73]  Gustavo E. A. P. A. Batista,et al.  An analysis of four missing data treatment methods for supervised learning , 2003, Appl. Artif. Intell..

[74]  Roderick J. A. Little,et al.  A Class of Pattern-Mixture Models for Normal Incomplete Data , 1994 .

[75]  Michael I. Jordan,et al.  Learning from Incomplete Data , 1994 .

[76]  A. Kong,et al.  Sequential imputation for multilocus linkage analysis. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[77]  Jason Roy,et al.  Modeling Longitudinal Data with Nonignorable Dropouts Using a Latent Dropout Class Model , 2003, Biometrics.

[78]  M D Schluchter,et al.  Methods for the analysis of informatively censored longitudinal data. , 1992, Statistics in medicine.

[79]  Karl G. Jöreskog,et al.  Lisrel 8: User's Reference Guide , 1997 .

[80]  B. Tabachnick,et al.  Using Multivariate Statistics , 1983 .

[81]  J. Robins,et al.  Analysis of semi-parametric regression models with non-ignorable non-response. , 1997, Statistics in medicine.

[82]  Jun S. Liu,et al.  Sequential importance sampling for nonparametric Bayes models: The next generation , 1999 .

[83]  Marie Reilly,et al.  Data analysis using hot deck multiple imputation , 1993 .

[84]  Jae Kwang Kim A note on approximate Bayesian bootstrap imputation , 2002 .

[85]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[86]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[87]  Charu C. Aggarwal,et al.  Mining massively incomplete data sets by conceptual reconstruction , 2001, KDD '01.

[88]  Donald B. Rubin,et al.  Multiple imputation in mixture models for nonignorable nonresponse with follow-ups , 1993 .

[89]  Qinbao Song,et al.  A Short Note on Safest Default Missingness Mechanism Assumptions , 2004, Empirical Software Engineering.

[90]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[91]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..