Dealing with Missing Values

In this chapter the reader is introduced to the approaches used in the literature to tackle the presence of Missing Values (MVs). In real-life data, information is frequently lost in data mining, caused by the presence of missing values in attributes. Several schemes have been studied to overcome the drawbacks produced by missing values in data mining tasks; one of the most well known is based on preprocessing, formally known as imputation. After the introduction in Sect. 4.1, the chapter begins with the theoretical background which analyzes the underlying distribution of the missingness in Sect. 4.2. From this point on, the successive sections go from the simplest approaches in Sect. 4.3, to the most advanced proposals, focusing in the imputation of the MVs. The scope of such advanced methods includes the classic maximum likelihood procedures, like Expectation-Maximization or Multiple-Imputation (Sect. 4.4) and the latest Machine Learning based approaches which use algorithms for classification or regression in order to accomplish the imputation (Sect. 4.5). Finally a comparative experimental study will be carried out in Sect. 4.6.

[1]  Iqbal Gondal,et al.  Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data , 2005, Bioinform..

[2]  R. Little,et al.  Maximum likelihood estimation for mixed continuous and categorical data with missing values , 1985 .

[3]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[4]  Chong-Ho Choi,et al.  Input Feature Selection by Mutual Information Based on Parzen Window , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Alfredo Vellido,et al.  Missing data imputation through GTM as a mixture of t-distributions , 2006, Neural Networks.

[6]  Andrew W. Moore,et al.  Locally Weighted Learning , 1997, Artificial Intelligence Review.

[7]  Leslie S. Smith,et al.  A neural network-based framework for the reconstruction of incomplete data sets , 2010, Neurocomputing.

[8]  Werasak Kurutach,et al.  An improvement of missing value imputation in DNA microarray data using cluster-based LLS method , 2013, 2013 13th International Symposium on Communications and Information Technologies (ISCIT).

[9]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[10]  Mohd Saberi Mohamad,et al.  Inferring Gene Regulatory Networks from Gene Expression Data by a Dynamic Bayesian Network-Based Model , 2012, DCAI.

[11]  Krzysztof Siminski Rough Fuzzy Subspace Clustering for Data with Missing Values , 2014, Comput. Informatics.

[12]  G. McLachlan Discriminant Analysis and Statistical Pattern Recognition , 1992 .

[13]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[14]  Wojtek J. Krzanowski,et al.  MULTIPLE DISCRIMINANT ANALYSIS IN THE PRESENCE OF MIXED CONTINUOUS AND CATEGORICAL DATA , 1986 .

[15]  James C. Bezdek,et al.  Nearest prototype classifier designs: An experimental study , 2001, Int. J. Intell. Syst..

[16]  Nada Lavrac,et al.  The Multi-Purpose Incremental Learning System AQ15 and Its Testing Application to Three Medical Domains , 1986, AAAI.

[17]  Yanchi Liu,et al.  Imputing Missing Values for Mixed Numeric and Categorical Attributes Based on Incomplete Data Hierarchical Clustering , 2011, KSEM.

[18]  Taghi M. Khoshgoftaar,et al.  Incomplete-Case Nearest Neighbor Imputation in Software Measurement Data , 2007, 2007 IEEE International Conference on Information Reuse and Integration.

[19]  D. Broomhead,et al.  Radial Basis Functions, Multi-Variable Functional Interpolation and Adaptive Networks , 1988 .

[20]  Shin Ishii,et al.  A Bayesian missing value estimation method for gene expression profile data , 2003, Bioinform..

[21]  Jerzy W. Grzymala-Busse,et al.  A Comparison of Several Approaches to Missing Attribute Values in Data Mining , 2000, Rough Sets and Current Trends in Computing.

[22]  T. Schneider Analysis of Incomplete Climate Data: Estimation of Mean Values and Covariance Matrices and Imputation of Missing Values. , 2001 .

[23]  Shohei Kato,et al.  Missing Value Imputation Method by Using Bayesian Network with Weighted Learning , 2012 .

[24]  Ahmet Arslan,et al.  A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm , 2013, Inf. Sci..

[25]  Gene H. Golub,et al.  Missing value estimation for DNA microarray gene expression data: local least squares imputation , 2005, Bioinform..

[26]  Witold Pedrycz,et al.  A Novel Framework for Imputation of Missing Values in Databases , 2007, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[27]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[28]  Md Zahidul Islam,et al.  kDMI: A Novel Method for Missing Values Imputation Using Two Levels of Horizontal Partitioning in a Data set , 2013, ADMA.

[29]  Amaury Lendasse,et al.  X-SOM and L-SOM: A double classification approach for missing value imputation , 2010, Neurocomputing.

[30]  Mortaza Jamshidian,et al.  MissMech: An R Package for Testing Homoscedasticity, Multivariate Normality, and Missing Completely at Random (MCAR) , 2014 .

[31]  Duc Truong Pham,et al.  SRI: A Scalable Rule Induction Algorithm , 2006 .

[32]  Jitender S. Deogun,et al.  Towards Missing Data Imputation: A Study of Fuzzy K-means Clustering Method , 2004, Rough Sets and Current Trends in Computing.

[33]  Durga Toshniwal,et al.  Missing Value Imputation Based on K-Mean Clustering with Weighted Distance , 2010, IC3.

[34]  Peter Clark,et al.  The CN2 Induction Algorithm , 1989, Machine Learning.

[35]  Paeiz Azmi,et al.  Missing value imputation in DNA microarrays based on conjugate gradient method , 2012, Comput. Biol. Medicine.

[36]  Md Zahidul Islam,et al.  A Decision Tree-based Missing Value Imputation Technique for Data Pre-processing , 2011, AusDM.

[37]  Gustavo E. A. P. A. Batista,et al.  An analysis of four missing data treatment methods for supervised learning , 2003, Appl. Artif. Intell..

[38]  Geoffrey I. Webb,et al.  Lazy Learning of Bayesian Rules , 2000, Machine Learning.

[39]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[40]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[41]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[42]  Bing Yu,et al.  Missing data analyses: a hybrid multiple imputation algorithm using Gray System Theory and entropy based on clustering , 2013, Applied Intelligence.

[43]  Swati Aggarwal,et al.  Hybrid model for data imputation: Using fuzzy c means and multi layer perceptron , 2014, 2014 IEEE International Advance Computing Conference (IACC).

[44]  Zili Zhang,et al.  Missing Value Estimation for Mixed-Attribute Data Sets , 2011, IEEE Transactions on Knowledge and Data Engineering.

[45]  R. Nedunchezhian,et al.  Radial Basis function Network dependent Exclusive Mutual interpolation for missing Value imputation , 2013, J. Comput. Sci..

[46]  Chengqi Zhang,et al.  Combining kNN Imputation and Bootstrap Calibrated: Empirical Likelihood for Incomplete Data Analysis , 2010, Int. J. Data Warehous. Min..

[47]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[48]  Md Zahidul Islam,et al.  Missing value imputation using decision trees and decision forests by splitting and merging records: Two novel techniques , 2013, Knowl. Based Syst..

[49]  Lawrence Carin,et al.  On Classification with Incomplete Data , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Andrew K. C. Wong,et al.  Synthesizing Statistical Knowledge from Incomplete Mixed-Mode Data , 1987, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Jerzy W. Grzymala-Busse,et al.  Handling Missing Attribute Values in Preterm Birth Data Sets , 2005, RSFDGrC.

[52]  Bingru Yang,et al.  A SVM Regression Based Approach to Filling in Missing Values , 2005, KES.

[53]  Estevam R. Hruschka,et al.  Missing values prediction with K2 , 2002, Intell. Data Anal..

[54]  Shehroz S. Khan,et al.  Bayesian Multiple Imputation Approaches for One-Class Classification , 2012, Canadian Conference on AI.

[55]  Yanyun Zhao,et al.  Imputation of missing data using ensemble algorithms , 2011, 2011 Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD).

[56]  Tsunenori Ishioka Imputation of missing values for semi-supervised data using the proximity in random forests , 2013, Int. J. Bus. Intell. Data Min..

[57]  Ian H. Witten,et al.  Generating Accurate Rule Sets Without Global Optimization , 1998, ICML.

[58]  Michael Schomaker,et al.  Model selection and model averaging after multiple imputation , 2014, Comput. Stat. Data Anal..

[59]  Md Zahidul Islam,et al.  FIMUS: A framework for imputing missing values using co-appearance, correlation and similarity analysis , 2014, Knowl. Based Syst..

[60]  Dorian Pyle,et al.  Data Preparation for Data Mining , 1999 .

[61]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[62]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[63]  Patrick Royston,et al.  Multiple Imputation by Chained Equations (MICE): Implementation in Stata , 2011 .

[64]  Lukasz A. Kurgan,et al.  Impact of imputation of missing values on classification error for discrete data , 2008, Pattern Recognit..

[65]  Fabrício Olivetti de França,et al.  Predicting missing values with biclustering: A coherence-based approach , 2013, Pattern Recognit..

[66]  W. Wong,et al.  The calculation of posterior distributions by data augmentation , 1987 .

[67]  Monique Frize,et al.  Influence of Missing Values on Artificial Neural Network Performance , 2001, MedInfo.

[68]  Fritz Scheuren,et al.  Multiple Imputation , 2005 .

[69]  Wai-Ki Ching,et al.  A weighted Local Least Squares Imputation method for missing value estimation in microarray gene expression data , 2010, Int. J. Data Min. Bioinform..

[70]  Karina Gibert Mixed intelligent-multivariate missing imputation , 2014, Int. J. Comput. Math..

[71]  Michael G. Kenward,et al.  A method for increasing the robustness of multiple imputation , 2012, Comput. Stat. Data Anal..

[72]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[73]  Jürgen Windeler,et al.  Intention‐to‐treat: methods for dealing with missing values in clinical trials of progressively deteriorating diseases , 2001, Statistics in medicine.

[74]  Ming Ouyang,et al.  Gaussian mixture clustering and imputation of microarray data , 2004, Bioinform..

[75]  Shouhong Wang,et al.  Mining incomplete survey data through classification , 2010, Knowledge and Information Systems.

[76]  J L Schafer,et al.  Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst's Perspective. , 1998, Multivariate behavioral research.

[77]  Kaberi Das,et al.  Removal and interpolation of missing values using wavelet neural network for heterogeneous data sets , 2012, ICACCI '12.

[78]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[79]  Martti Juhola,et al.  Missing values: how many can they be to preserve classification reliability? , 2011, Artificial Intelligence Review.

[80]  Dieter William Joenssen,et al.  Hot Deck Methods for Imputing Missing Data - The Effects of Limiting Donor Usage , 2012, MLDM.

[81]  Juan Carlos Figueroa García,et al.  Missing data imputation in multivariate data by evolutionary algorithms , 2011, Comput. Hum. Behav..

[82]  Shichao Zhang,et al.  earest neighbor selection for iteratively k NN imputation , 2012 .

[83]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[84]  Chih-Jen Lin,et al.  Working Set Selection Using Second Order Information for Training Support Vector Machines , 2005, J. Mach. Learn. Res..

[85]  Xindong Wu,et al.  Efficient missing data imputation for supervised learning , 2010, 9th IEEE International Conference on Cognitive Informatics (ICCI'10).

[86]  Martin Fodslette Møller,et al.  A scaled conjugate gradient algorithm for fast supervised learning , 1993, Neural Networks.

[87]  Xiao-Li Meng,et al.  Statistical Methods in Medical Research Applications of Multiple Imputation in Medical Studies: from Aids to Nhanes , 2022 .

[88]  Werasak Kurutach,et al.  Cluster-based KNN missing value imputation for DNA microarray data , 2012, 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[89]  Xindong Wu,et al.  Induction By Attribute Elimination , 1999, IEEE Trans. Knowl. Data Eng..

[90]  Behrooz Safarinejadian,et al.  A distributed EM algorithm to estimate the parameters of a finite mixture of components , 2009, Knowledge and Information Systems.

[91]  D.T. Pham,et al.  RULES-6: a simple rule induction algorithm for supporting decision making , 2005, 31st Annual Conference of IEEE Industrial Electronics Society, 2005. IECON 2005..

[92]  John C. Platt A Resource-Allocating Network for Function Interpolation , 1991, Neural Computation.

[93]  Esther-Lydia Silva-Ramírez,et al.  Missing value imputation on missing completely at random data using multilayer perceptrons , 2011, Neural Networks.

[94]  C A Foord,et al.  High-speed ball bearing analysis , 2006 .

[95]  D. Rubin Multiple imputation for nonresponse in surveys , 1989 .

[96]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[97]  Chong-Ho Choi,et al.  Input feature selection for classification problems , 2002, IEEE Trans. Neural Networks.

[98]  Panos Liatsis,et al.  A robust missing value imputation method for noisy data , 2010, Applied Intelligence.