Missing data analyses: a hybrid multiple imputation algorithm using Gray System Theory and entropy based on clustering

Researchers and practitioners who use databases usually feel that it is cumbersome in knowledge discovery or application development due to the issue of missing data. Though some approaches can work with a certain rate of incomplete data, a large portion of them demands high data quality with completeness. Therefore, a great number of strategies have been designed to process missingness particularly in the way of imputation. Single imputation methods initially succeeded in predicting the missing values for specific types of distributions. Yet, the multiple imputation algorithms have maintained prevalent because of the further promotion of validity by minimizing the bias iteratively and less requirement on prior knowledge to the distributions.This article carefully reviews the state of the art and proposes a hybrid missing data completion method named Multiple Imputation using Gray-system-theory and Entropy based on Clustering (MIGEC). Firstly, the non-missing data instances are separated into several clusters. Then, the imputed value is obtained after multiple calculations by utilizing the information entropy of the proximal category for each incomplete instance in terms of the similarity metric based on Gray System Theory (GST).Experimental results on University of California Irvine (UCI) datasets illustrate the superiority of MIGEC to other current achievements on accuracy for either numeric or categorical attributes under different missing mechanisms. Further discussion on real aerospace datasets states MIGEC is also applicable for the specific area with both more precise inference and faster convergence than other multiple imputation methods in general.

[1]  Michael A. Posner,et al.  Comparing Standard Regression, Propensity Score Matching, and Instrumental Variables Methods for Determining the Influence of Mammography on Stage of Diagnosis , 2001, Health Services and Outcomes Research Methodology.

[2]  Witold Pedrycz,et al.  Experimental analysis of methods for imputation of missing values in databases , 2004, SPIE Defense + Commercial Sensing.

[3]  Craig K. Enders,et al.  An introduction to modern missing data analyses. , 2010, Journal of school psychology.

[4]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[5]  James C. Bezdek,et al.  Fuzzy c-means clustering of incomplete data , 2001, IEEE Trans. Syst. Man Cybern. Part B.

[6]  Panos Liatsis,et al.  A robust missing value imputation method for noisy data , 2010, Applied Intelligence.

[7]  Shyi-Ming Chen,et al.  ESTIMATING NULL VALUES IN THE DISTRIBUTED RELATIONAL DATABASES ENVIRONMENT , 2000, Cybern. Syst..

[8]  Xiaofeng Zhu,et al.  Missing Data Analysis: A Kernel-Based Multi-Imputation Approach , 2009, Trans. Comput. Sci..

[9]  María del Mar Rueda,et al.  New imputation methods for missing data using quantiles , 2009, J. Comput. Appl. Math..

[10]  Chengqi Zhang,et al.  Missing Value Imputation Based on Data Clustering , 2008, Trans. Comput. Sci..

[11]  Hong Gu,et al.  A fuzzy c-means clustering algorithm based on nearest-neighbor intervals for incomplete data , 2010, Expert Syst. Appl..

[12]  Shyi-Ming Chen,et al.  Generating weighted fuzzy rules from relational database systems for estimating values using genetic algorithms , 2003, IEEE Trans. Fuzzy Syst..

[13]  Sadaaki Miyamoto,et al.  Rough Sets and Current Trends in Computing , 2012, Lecture Notes in Computer Science.

[14]  Marina L. Gavrilova,et al.  Transactions on Computational Science III , 2009, Lecture Notes in Computer Science.

[15]  Zhi-Hua Zhou,et al.  Multi-instance clustering with applications to multi-instance prediction , 2009, Applied Intelligence.

[16]  Marina L. Gavrilova,et al.  Transactions on Computational Science I , 2008, Trans. Comput. Sci..

[17]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[18]  M. Rueda,et al.  An improved estimator to analyse missing data , 2008 .

[19]  I. Kononenko,et al.  INDUCTION OF DECISION TREES USING RELIEFF , 1995 .

[20]  Lígia P. Brás,et al.  Improving cluster-based missing value estimation of DNA microarray data. , 2007, Biomolecular engineering.

[21]  Alessandro G. Di Nuovo,et al.  Missing data analysis with fuzzy C-Means: A study of its application in a psychological scenario , 2011, Expert Syst. Appl..

[22]  Min Pan,et al.  Based on kernel function and non-parametric multiple imputation algorithm to solve the problem of missing data , 2011, MSIE 2011.

[23]  Søren Feodor Nielsen,et al.  1. Statistical Analysis with Missing Data (2nd edn). Roderick J. Little and Donald B. Rubin, John Wiley & Sons, New York, 2002. No. of pages: xv+381. ISBN: 0‐471‐18386‐5 , 2004 .

[24]  Ugo Guarnera,et al.  Semiparametric predictive mean matching , 2009 .

[25]  Byung-Won On,et al.  Meta similarity , 2011, Applied Intelligence.

[26]  C. Das,et al.  A novel interpolation based missing value estimation method to predict missing values in microarray gene expression data , 2012, 2012 International Conference on Communications, Devices and Intelligent Systems (CODIS).

[27]  James M. Keller,et al.  Fuzzy Models and Algorithms for Pattern Recognition and Image Processing , 1999 .

[28]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[29]  Xiaohui Liu,et al.  Progress in Intelligent Data Analysis , 1999, Applied Intelligence.

[30]  Elena Castro,et al.  Statistical user model supported by R-Tree structure , 2013, Applied Intelligence.

[31]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .

[32]  Matteo Magnani,et al.  Techniques for Dealing with Missing Data in Knowledge Discovery Tasks , 2004 .

[33]  Chi-Chun Huang,et al.  A Grey-Based Nearest Neighbor Approach for Missing Attribute Value Prediction , 2004, Applied Intelligence.

[34]  Chengqi Zhang,et al.  POP algorithm: Kernel-based imputation to treat missing values in knowledge discovery from databases , 2009, Expert Syst. Appl..

[35]  Harri Niska,et al.  Methods for imputation of missing values in air quality data sets , 2004 .

[36]  Chi-Chun Huang,et al.  An instance-based learning approach based on grey relational structure , 2006, Applied Intelligence.

[37]  Shichao Zhang,et al.  Clustering-based Missing Value Imputation for Data Preprocessing , 2006, 2006 4th IEEE International Conference on Industrial Informatics.

[38]  Victor J. Rayward-Smith,et al.  Adapting k-means for supervised clustering , 2006, Applied Intelligence.

[39]  Geoffrey J. McLachlan,et al.  Analyzing Microarray Gene Expression Data , 2004 .

[40]  Tariq Samad,et al.  Imputation of Missing Data in Industrial Databases , 1999, Applied Intelligence.

[41]  Jennifer Dixon,et al.  Modern Alternatives for Dealing with Missing Data in Special Education Research , 2006 .

[42]  Juan Carlos Figueroa García,et al.  Missing data imputation in multivariate data by evolutionary algorithms , 2011, Comput. Hum. Behav..

[43]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[44]  Bhekisipho Twala,et al.  AN EMPIRICAL COMPARISON OF TECHNIQUES FOR HANDLING INCOMPLETE DATA USING DECISION TREES , 2009, Appl. Artif. Intell..

[45]  Marco Di Zio,et al.  Imputation through finite Gaussian mixture models , 2007, Comput. Stat. Data Anal..

[46]  Deng Ju-Long,et al.  Control problems of grey systems , 1982 .

[47]  Ki-Yeol Kim,et al.  Reuse of imputed data in microarray analysis increases imputation efficiency , 2004, BMC Bioinformatics.

[48]  Ah-Hwee Tan,et al.  Explaining inferences in Bayesian networks , 2008, Applied Intelligence.

[49]  Michel Verleysen,et al.  K nearest neighbours with mutual information for simultaneous classification and missing data imputation , 2009, Neurocomputing.

[50]  M. Aldenderfer,et al.  Cluster Analysis. Sage University Paper Series On Quantitative Applications in the Social Sciences 07-044 , 1984 .

[51]  Estevam R. Hruschka,et al.  A Bayesian imputation method for a clustering genetic algorithm , 2011, J. Comput. Methods Sci. Eng..

[52]  Stefano Ferilli,et al.  Boosting learning and inference in Markov logic through metaheuristics , 2011, Applied Intelligence.

[53]  Phil D. Green,et al.  Speech enhancement with missing data techniques using recurrent neural networks , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[54]  T. Stijnen,et al.  Review: a gentle introduction to imputation of missing values. , 2006, Journal of clinical epidemiology.

[55]  Charles F. Hockett,et al.  A mathematical theory of communication , 1948, MOCO.

[56]  Kung-Sik Chan,et al.  Efficient Markov chain Monte Carlo with incomplete multinomial data , 2010, Stat. Comput..

[57]  Jitender S. Deogun,et al.  Towards Missing Data Imputation: A Study of Fuzzy K-means Clustering Method , 2004, Rough Sets and Current Trends in Computing.

[58]  Xiaofeng Zhu,et al.  Missing data imputation by utilizing information within incomplete instances , 2011, J. Syst. Softw..

[59]  Chao-Ying Joanne Peng,et al.  Comparison of Two Approaches for Handling Missing Covariates in Logistic Regression , 2008 .

[60]  Graham K. Rand,et al.  Quantitative Applications in the Social Sciences , 1983 .

[61]  Craig K. Enders,et al.  Applied Missing Data Analysis , 2010 .

[62]  Shichao Zhang,et al.  Shell-neighbor method and its application in missing data imputation , 2011, Applied Intelligence.

[63]  David Lubinsky,et al.  Classification trees with bivariate splits , 1994, Applied Intelligence.

[64]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .