Imputing missing value through ensemble concept based on statistical measures

Many datasets include missing values in their attributes. Data mining techniques are not applicable in the presence of missing values. So an important step in preprocessing of a data mining task is missing value management. One of the most important categories in missing value management techniques is missing value imputation. This paper presents a new imputation technique. The proposed imputation technique is based on statistical measurements. The suggested imputation technique employs an ensemble of the estimators built to estimate the missing values based on positive and negative correlated observed attributes separately. Each estimator guesses a value for a missed value based on the average and variance of that feature. The average and variance of the feature are estimated from the non-missed values of that feature. The final consensus value for a missed value is the weighted aggregation of the values estimated by different estimators. The chief weight is attribute correlation, and the slight weight is dependent to kernel function such as kurtosis, skewness, number of involved samples and composition of them. The missing values are deliberately produced randomly at different levels. The experimentations indicate that the suggested technique has a good accuracy in comparison with the classical methods.

[1]  Edgar Acuña,et al.  The Treatment of Missing Values and its Effect on Classifier Accuracy , 2004 .

[2]  Monique Frize,et al.  Imputation of missing values by integrating neural networks and case-based reasoning , 2008, 2008 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[3]  Michael I. Jordan,et al.  Supervised learning from incomplete data via an EM approach , 1993, NIPS.

[4]  Peter Bühlmann,et al.  MissForest - non-parametric missing value imputation for mixed-type data , 2011, Bioinform..

[5]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[6]  Taghi M. Khoshgoftaar,et al.  Incomplete-Case Nearest Neighbor Imputation in Software Measurement Data , 2007, 2007 IEEE International Conference on Information Reuse and Integration.

[7]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[8]  Chengqi Zhang,et al.  Semi-parametric optimization for missing data imputation , 2007, Applied Intelligence.

[9]  Pilsung Kang,et al.  Locally linear reconstruction based missing value imputation for supervised learning , 2013, Neurocomputing.

[10]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[11]  Michel Verleysen,et al.  Distance estimation in numerical data sets with missing values , 2013, Inf. Sci..

[12]  Lukasz A. Kurgan,et al.  Impact of imputation of missing values on classification error for discrete data , 2008, Pattern Recognit..

[13]  Chengqi Zhang,et al.  Missing Value Imputation Based on Data Clustering , 2008, Trans. Comput. Sci..

[14]  S. Lipsitz,et al.  Missing-Data Methods for Generalized Linear Models , 2005 .

[15]  Peter Filzmoser,et al.  Imputation of missing values for compositional data using classical and robust methods , 2008 .

[16]  Ming Dong,et al.  Selection-fusion approach for classification of datasets with missing values , 2010, Pattern Recognit..

[17]  Esther-Lydia Silva-Ramírez,et al.  Missing value imputation on missing completely at random data using multilayer perceptrons , 2011, Neural Networks.

[18]  Taghi M. Khoshgoftaar,et al.  Using Classifier-Based Nominal Imputation to Improve Machine Learning , 2011, PAKDD.

[19]  Panos Liatsis,et al.  A robust missing value imputation method for noisy data , 2010, Applied Intelligence.

[20]  Estevam R. Hruschka,et al.  Evaluating a Nearest-Neighbor Method to Substitute Continuous Missing Values , 2003, Australian Conference on Artificial Intelligence.

[21]  John Wang,et al.  Data Mining: Opportunities and Challenges , 2003 .

[22]  Hongan Wang,et al.  Missing Data Imputation: A Fuzzy K-means Clustering Algorithm over Sliding Window , 2009, 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery.

[23]  Steven D. Brown,et al.  Comparison of five iterative imputation methods for multivariate classification , 2013 .

[24]  Huaxiong Li Missing Values Imputation Based on Iterative Learning , 2013 .

[25]  Jerzy W. Grzymala-Busse,et al.  A Comparison of Several Approaches to Missing Attribute Values in Data Mining , 2000, Rough Sets and Current Trends in Computing.

[26]  Shichao Zhang,et al.  Shell-neighbor method and its application in missing data imputation , 2011, Applied Intelligence.

[27]  A. Barczak,et al.  Statistical Description of Life-Time Alcohol Consumption Based on Survey Data , 2005 .