A Novel Soft Computing Hybrid for Data Imputation

We propose a novel 2-stage soft computing approach for data imputation, involving local learning and global approximation in tandem, whereas in the literature only one of them is used. In stage 1, K-means algorithm is used to replace the missing values with cluster centers. Stage 2 refines the resultant approximate values using multilayer perceptron (MLP). MLP is trained on the complete data with the attribute having missing values as the target variable one at a time. The hybrid is tested on 2 benchmark problems each in classification and regression using 10-fold cross validation. In all datasets, some values, which are randomly removed, are treated as missing values. The actual and the predicted values obtained are compared by using Mean Absolute Percentage Error (MAPE). We observe that, the MAPE value is reduced from stage 1 to stage 2, indicating the hybrid approach resulted in better imputation compared to stage 1 alone.

[1]  T. Marwala,et al.  Fault classification in structures with incomplete measured data using autoassociative neural networks and genetic algorithm , 2006 .

[2]  Leslie S. Smith,et al.  A neural network-based framework for the reconstruction of incomplete data sets , 2010, Neurocomputing.

[3]  Gustavo E. A. P. A. Batista,et al.  A Study of K-Nearest Neighbour as an Imputation Method , 2002, HIS.

[4]  Soo-Young Lee,et al.  Training Algorithm with Incomplete Data for Feed-Forward Neural Networks , 1999, Neural Processing Letters.

[5]  Gustavo E. A. P. A. Batista,et al.  Experimental comparison pf K-NEAREST NEIGHBOUR and MEAN OR MODE imputation methods with the internal strategies used by C4.5 and CN2 to treat missing data , 2003 .

[6]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[7]  Leonardo Franco,et al.  Missing data imputation in breast cancer prognosis , 2006 .

[8]  Aníbal R. Figueiras-Vidal,et al.  Pattern classification with missing data: a review , 2010, Neural Computing and Applications.

[9]  Shouhong Wang,et al.  The Use of Ontology for Data Mining with Incomplete Data , 2010, Principle Advancements in Database Management Technologies.

[10]  Hui-Chuan Chen,et al.  Estimating missing data of wind speeds using neural network , 2002, Proceedings IEEE SoutheastCon 2002 (Cat. No.02CH37283).

[11]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[12]  Amit Gupta,et al.  Estimating Missing Values Using Neural Networks , 1996 .

[13]  David E. Booth,et al.  Analysis of Incomplete Multivariate Data , 2000, Technometrics.

[14]  Lena Kallin Westin Missing data and the preprocessing perceptron , 2004 .

[15]  M. Marseguerra,et al.  The AutoAssociative Neural Network in signal analysis: II. Application to on-line monitoring of a simulated BWR component , 2005 .

[16]  Tshilidzi Marwala,et al.  The use of genetic algorithms and neural networks to approximate missing data in database , 2005, IEEE 3rd International Conference on Computational Cybernetics, 2005. ICCC 2005..

[17]  S. Nordbotten Neural network imputation applied to the Norwegian 1990 population census data , 1996 .

[18]  Peter K. Sharpe,et al.  Dealing with missing values in neural network-based diagnostic systems , 1995, Neural Computing & Applications.

[19]  SongQinbao,et al.  A new imputation method for small software project data sets , 2007 .