A robust missing value imputation method for noisy data

Missing data imputation is an important research topic in data mining. The impact of noise is seldom considered in previous works while real-world data often contain much noise. In this paper, we systematically investigate the impact of noise on imputation methods and propose a new imputation approach by introducing the mechanism of Group Method of Data Handling (GMDH) to deal with incomplete data with noise. The performance of four commonly used imputation methods is compared with ours, called RIBG (robust imputation based on GMDH), on nine benchmark datasets. The experimental result demonstrates that noise has a great impact on the effectiveness of imputation techniques and our method RIBG is more robust to noise than the other four imputation methods used as benchmark.

[1]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[2]  J. Schafer Multiple imputation: a primer , 1999, Statistical methods in medical research.

[3]  Alex Aussem,et al.  A Conservative Feature Subset Selection Algorithm with Missing Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[4]  Michael V. Mannino,et al.  Classification algorithm sensitivity to training data with non representative attribute noise , 2009, Decis. Support Syst..

[5]  Søren Feodor Nielsen,et al.  1. Statistical Analysis with Missing Data (2nd edn). Roderick J. Little and Donald B. Rubin, John Wiley & Sons, New York, 2002. No. of pages: xv+381. ISBN: 0‐471‐18386‐5 , 2004 .

[6]  Ingunn Myrtveit,et al.  Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods , 2001, IEEE Trans. Software Eng..

[7]  Frank Lemke,et al.  Self-Organizing Data Mining , 1998, Workshop Data Mining und Data Warehousing.

[8]  Sung-Kwun Oh,et al.  The design of self-organizing Polynomial Neural Networks , 2002, Inf. Sci..

[9]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[10]  Donald E. Brown,et al.  Induction and polynomial networks , 1995, 1995 IEEE International Conference on Systems, Man and Cybernetics. Intelligent Systems for the 21st Century.

[11]  Xiao-Hua Zhou,et al.  Multiple imputation: review of theory, implementation and software , 2007, Statistics in medicine.

[12]  Foster J. Provost,et al.  Handling Missing Values when Applying Classification Models , 2007, J. Mach. Learn. Res..

[13]  Xindong Wu,et al.  Mining With Noise Knowledge: Error-Aware Data Mining , 2008, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[14]  Lawrence Carin,et al.  On Classification with Incomplete Data , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[16]  D. Rubin,et al.  Fully conditional specification in multivariate imputation , 2006 .

[17]  Alan Olinsky,et al.  The comparative efficacy of imputation methods for missing data in structural equation modeling , 2003, Eur. J. Oper. Res..

[18]  Lukasz A. Kurgan,et al.  Impact of imputation of missing values on classification error for discrete data , 2008, Pattern Recognit..

[19]  Ali Moeini,et al.  Investigating the efficiency in oil futures market based on GMDH approach , 2009, Expert Syst. Appl..

[20]  Taghi M. Khoshgoftaar,et al.  A comprehensive empirical evaluation of missing value imputation in noisy software measurement data , 2008, J. Syst. Softw..

[21]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[22]  Jean-Frangois Beaumont ON REGRESSION IMPUTATION IN THE PRESENCE OF NONIGNORABLE NONRESPONSE , 2002 .

[23]  F. Lemke,et al.  Self-Organising Data Mining , 2003 .

[24]  Xingquan Zhu,et al.  Class Noise vs. Attribute Noise: A Quantitative Study , 2003, Artificial Intelligence Review.

[25]  Chengqi Zhang,et al.  Semi-parametric optimization for missing data imputation , 2007, Applied Intelligence.

[26]  Tariq Samad,et al.  Imputation of Missing Data in Industrial Databases , 1999, Applied Intelligence.

[27]  Witold Pedrycz,et al.  A Novel Framework for Imputation of Missing Values in Databases , 2007, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[28]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data , 1988 .

[29]  Anatoliĭ Timofeevich Fomenko,et al.  The present state of the theory , 1990 .

[30]  Costanza Calzolari,et al.  Development of pedotransfer functions using a group method of data handling for the soil of the Pianura Padano-Veneta region of North Italy: water retention properties , 2005 .

[31]  R. E. Abdel-Aal,et al.  GMDH-based feature ranking and selection for improved classification of medical data , 2005, J. Biomed. Informatics.

[32]  Vicenç Puig,et al.  A GMDH neural network-based approach to passive robust fault detection using a constraint satisfaction backward test , 2007, Eng. Appl. Artif. Intell..

[33]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[34]  Bhekisipho Twala,et al.  AN EMPIRICAL COMPARISON OF TECHNIQUES FOR HANDLING INCOMPLETE DATA USING DECISION TREES , 2009, Appl. Artif. Intell..

[35]  Estevam R. Hruschka,et al.  Bayesian networks for imputation in classification problems , 2007, Journal of Intelligent Information Systems.

[36]  M. Gibson,et al.  Beyond ANOVA: Basics of Applied Statistics. , 1986 .

[37]  Gustavo E. A. P. A. Batista,et al.  An analysis of four missing data treatment methods for supervised learning , 2003, Appl. Artif. Intell..

[38]  A. G. Ivakhnenko,et al.  Polynomial Theory of Complex Systems , 1971, IEEE Trans. Syst. Man Cybern..

[39]  James C. Bezdek,et al.  Clustering incomplete relational data using the non-Euclidean relational fuzzy c-means algorithm , 2002, Pattern Recognit. Lett..

[40]  Shyi-Ming Chen,et al.  Generating weighted fuzzy rules from relational database systems for estimating values using genetic algorithms , 2003, IEEE Trans. Fuzzy Syst..

[41]  Nikos Tsikriktsis,et al.  A review of techniques for treating missing data in OM survey research , 2005 .

[42]  Subramani Mani,et al.  Building Bayesian Network Models in Medicine: The MENTOR Experience , 2005, Applied Intelligence.

[43]  Chi-Chun Huang,et al.  A Grey-Based Nearest Neighbor Approach for Missing Attribute Value Prediction , 2004, Applied Intelligence.

[44]  Shyi-Ming Chen,et al.  A new approach to generate weighted fuzzy rules using genetic algorithms for estimating null values , 2008, Expert Syst. Appl..