Information-decomposition-model-based missing value estimation for not missing at random dataset

Missing data estimation is an important strategy for improving learning performance in learning from incomplete data, especially, when there are non discardable records with missing values. However, most of the existing algorithms are focused on missing at random (MAR) or missing completely at random (MCAR), and less attention has been paid to data not missing at random (NMAR). In this paper, an information decomposition imputation (IDIM) algorithm using fuzzy membership function is proposed for addressing the missing value problem under NMAR. Firstly, the proposed IDIM algorithm is presented with detailed examples. Then, the proposed approach is evaluated with extensive experiments compared with some typical algorithms. The experimental results demonstrate that the proposed algorithm has higher accuracy than the exiting imputation approaches in terms of normal root mean square error (NRMSE) and TP+TN evaluation under different missing strategies.

[1]  M. Kenward,et al.  Informative Drop‐Out in Longitudinal Data Analysis , 1994 .

[2]  R. Little,et al.  Selection and pattern-mixture models , 2008 .

[3]  Chengqi Zhang,et al.  Semi-parametric optimization for missing data imputation , 2007, Applied Intelligence.

[4]  Ao Li,et al.  Missing value estimation for DNA microarray gene expression data by Support Vector Regression imputation and orthogonal coding scheme , 2006, BMC Bioinformatics.

[5]  Tariq Samad,et al.  Imputation of Missing Data in Industrial Databases , 1999, Applied Intelligence.

[6]  Graham R. Wood,et al.  A multi-stage approach to clustering and imputation of gene expression profiles , 2007, Bioinform..

[7]  Gene H. Golub,et al.  Missing value estimation for DNA microarray gene expression data: local least squares imputation , 2005, Bioinform..

[8]  Michael J Daniels,et al.  A General Class of Pattern Mixture Models for Nonignorable Dropout with Many Possible Dropout Times , 2008, Biometrics.

[9]  Tianwei Yu,et al.  Incorporating Nonlinear Relationships in Microarray Missing Value Imputation , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[10]  Marina V. Fomina,et al.  Problem of knowledge discovery in noisy databases , 2011, Int. J. Mach. Learn. Cybern..

[11]  Cécile Proust-Lima,et al.  The International Journal of Biostatistics Pattern Mixture Models and Latent Class Models for the Analysis of Multivariate Longitudinal Data with Informative Dropouts , 2011 .

[12]  Adriana Pérez,et al.  Use of the mean, hot deck and multiple imputation techniques to predict outcome in intensive care unit patients in Colombia , 2002, Statistics in medicine.

[13]  Zili Zhang,et al.  Missing Value Estimation for Mixed-Attribute Data Sets , 2011, IEEE Transactions on Knowledge and Data Engineering.

[14]  Roderick J. A. Little,et al.  Modeling the Drop-Out Mechanism in Repeated-Measures Studies , 1995 .

[15]  Judi Scheffer,et al.  Dealing with Missing Data , 2020, The Big R‐Book.

[16]  Hong Yan,et al.  Autoregressive-Model-Based Missing Value Estimation for DNA Microarray Time Series Data , 2009, IEEE Transactions on Information Technology in Biomedicine.

[17]  Shichao Zhang,et al.  "Missing is useful": missing values in cost-sensitive decision trees , 2005, IEEE Transactions on Knowledge and Data Engineering.

[18]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[19]  Peter Haider,et al.  Learning from incomplete data with infinite imputations , 2008, ICML '08.

[20]  Maoguo Gong,et al.  Fuzzy clustering with non-local information for image segmentation , 2014, Int. J. Mach. Learn. Cybern..

[21]  Leonardo Franco,et al.  Missing data imputation using statistical and machine learning methods in a real breast cancer problem , 2010, Artif. Intell. Medicine.

[22]  J. N. K. Rao,et al.  Empirical likelihood-based inference under imputation for missing response data , 2002 .

[23]  Christopher J. Merz,et al.  UCI Repository of Machine Learning Databases , 1996 .

[24]  Virendra P. Vishwakarma,et al.  Illumination normalization using fuzzy filter in DCT domain for face recognition , 2013, International Journal of Machine Learning and Cybernetics.

[25]  Miroslav Dudík,et al.  Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling , 2007, J. Mach. Learn. Res..

[26]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[27]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[28]  Shin Ishii,et al.  A Bayesian missing value estimation method for gene expression profile data , 2003, Bioinform..

[29]  T. H. Bø,et al.  LSimpute: accurate estimation of missing values in microarray data with least squares methods. , 2004, Nucleic acids research.

[30]  Michael I. Jordan,et al.  Mixture models for learning from incomplete data , 1997, COLT 1997.

[31]  Huang Chong-fu,et al.  Demonstration of benefit of information distribution for probability estimation , 2000 .

[32]  G. Molenberghs,et al.  A Latent‐Class Mixture Model for Incomplete Longitudinal Gaussian Data , 2008, Biometrics.

[33]  Shichao Zhang,et al.  Parimputation: From Imputation and Null-Imputation to Partially Imputation , 2008, IEEE Intell. Informatics Bull..

[34]  M. Kenward Selection models for repeated measurements with non-random dropout: an illustration of sensitivity. , 1998, Statistics in medicine.

[35]  D. Rubin Multiple imputation for nonresponse in surveys , 1989 .

[36]  Geert Molenberghs,et al.  The nature of sensitivity in monotone missing not at random models , 2006, Comput. Stat. Data Anal..

[37]  Chengqi Zhang,et al.  GBKII: An Imputation Method for Missing Values , 2007, PAKDD.

[38]  Chengqi Zhang,et al.  POP algorithm: Kernel-based imputation to treat missing values in knowledge discovery from databases , 2009, Expert Syst. Appl..

[39]  S. Albert Paul,et al.  Shared-parameter models , 2008 .

[40]  T. Schneider Analysis of Incomplete Climate Data: Estimation of Mean Values and Covariance Matrices and Imputation of Missing Values. , 2001 .

[41]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[42]  Foster J. Provost,et al.  Handling Missing Values when Applying Classification Models , 2007, J. Mach. Learn. Res..

[43]  B. Muthén,et al.  Growth modeling with nonignorable dropout: alternative analyses of the STAR*D antidepressant trial. , 2011, Psychological methods.

[44]  Lotfi A. Zadeh,et al.  Fuzzy Sets , 1996, Inf. Control..

[45]  Joseph W Hogan,et al.  Handling drop‐out in longitudinal studies , 2004, Statistics in medicine.

[46]  Paola Sebastiani,et al.  Learning Bayesian Networks from Incomplete Databases , 1997, UAI.

[47]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[48]  Geert Molenberghs,et al.  Selection models and pattern‐mixture models to analyse longitudinal quality of life data subject to drop‐out , 2002, Statistics in medicine.