Data imputation via evolutionary computation, clustering and a neural network

In this paper, two novel hybrid imputation methods involving particle swarm optimization (PSO), evolving clustering method (ECM) and autoassociative extreme learning machine (AAELM) in tandem are proposed, which also preserve the covariance structure of the data. Further, we removed the randomness of AAELM by invoking ECM between input and hidden layers. Moreover, we selected the optimal value of Dthr using PSO, which simultaneously minimizes two error functions viz., (i) mean squared error between the covariance matrix of the set of complete records and that of the set of total records, including imputed ones and (ii) absolute difference between the determinants of the two covariance matrices. The proposed methods outperformed many existing imputation methods in majority of the datasets. Finally, we also performed a statistical significance testing to ensure the credibility of our obtained results. Superior performance of one of the hybrids is attributed to the power of hybrid of local learning, global optimization and global learning. Both methods resolved a nagging issue of the difficult choice of Dthr value and its dominant influence on the results in ECM based imputation. We conclude that the proposed models can be used as a viable alternative to the existing ones for the data imputation.

[1]  Tshilidzi Marwala,et al.  Partial imputation of unseen records to improve classification using a hybrid multi-layered artificial immune system and genetic algorithm , 2013, Appl. Soft Comput..

[2]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[3]  Pilsung Kang,et al.  Locally linear reconstruction based missing value imputation for supervised learning , 2013, Neurocomputing.

[4]  Amit Gupta,et al.  Estimating Missing Values Using Neural Networks , 1996 .

[5]  Ignacio Olmeda,et al.  Hybrid Classifiers for Financial Multicriteria Decision Making: The Case of Bankruptcy Prediction , 1997 .

[6]  D. Harville Matrix Algebra From a Statistician's Perspective , 1998 .

[7]  Serpil Canbas,et al.  Prediction of commercial bank failure via multivariate statistical analysis of financial structures: The Turkish case , 2005, Eur. J. Oper. Res..

[8]  Vadlamani Ravi,et al.  Soft computing based imputation and hybrid data and text mining: The case of predicting the severity of phishing alerts , 2012, Expert Syst. Appl..

[9]  Amaury Lendasse,et al.  X-SOM and L-SOM: A double classification approach for missing value imputation , 2010, Neurocomputing.

[10]  Paredes Fierro,et al.  Análisis multivariante de unos datos de ecotoxicología , 2017 .

[11]  Nikola Kasabov,et al.  Dynamic Evolving Neuro-Fuzzy Inference System (DENFIS): On-line learning and Application for Time-Series Prediction , 2000 .

[12]  Vadlamani Ravi,et al.  Evolving clustering based data imputation , 2014, 2014 International Conference on Circuits, Power and Computing Technologies [ICCPCT-2014].

[13]  M. Marseguerra,et al.  The AutoAssociative Neural Network in signal analysis: II. Application to on-line monitoring of a simulated BWR component , 2005 .

[14]  Vadlamani Ravi,et al.  A Computational Intelligence Based Online Data Imputation Method: An Application For Banking , 2013, J. Inf. Process. Syst..

[15]  Md Zahidul Islam,et al.  Missing value imputation using decision trees and decision forests by splitting and merging records: Two novel techniques , 2013, Knowl. Based Syst..

[16]  Vadlamani Ravi,et al.  A Novel Soft Computing Hybrid for Data Imputation , 2022 .

[17]  Vadlamani Ravi,et al.  A new online data imputation method based on general regression auto associative neural network , 2014, Neurocomputing.

[18]  Peter K. Sharpe,et al.  Dealing with missing values in neural network-based diagnostic systems , 1995, Neural Computing & Applications.

[19]  A. V. Olgac,et al.  Performance Analysis of Various Activation Functions in Generalized MLP Architectures of Neural Networks , 2011 .

[20]  Aníbal R. Figueiras-Vidal,et al.  Classifying patterns with missing values using Multi-Task Learning perceptrons , 2013, Expert Syst. Appl..

[21]  Fengzhan Tian,et al.  A selective Bayes Classifier for classifying incomplete data based on gain ratio , 2008, Knowl. Based Syst..

[22]  Juan Carlos Figueroa García,et al.  Missing data imputation in multivariate data by evolutionary algorithms , 2011, Comput. Hum. Behav..

[23]  Yoshua Bengio,et al.  Série Scientifique Scientific Series Incorporating Second-order Functional Knowledge for Better Option Pricing Incorporating Second-order Functional Knowledge for Better Option Pricing , 2022 .

[24]  Slobodan P. Simonovic,et al.  Estimation of missing streamflow data using principles of chaos theory , 2002 .

[25]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[26]  Esther-Lydia Silva-Ramírez,et al.  Missing value imputation on missing completely at random data using multilayer perceptrons , 2011, Neural Networks.

[27]  N. H. Timm Applied Multivariate Analysis , 2002 .

[28]  Jonathan N. Crook,et al.  Credit Scoring and Its Applications , 2002, SIAM monographs on mathematical modeling and computation.

[29]  Guang-Bin Huang,et al.  Extreme learning machine: a new learning scheme of feedforward neural networks , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[30]  Peter C. Austin,et al.  Bayesian modeling of missing data in clinical research , 2005, Comput. Stat. Data Anal..

[31]  Bogdan Gabrys,et al.  Neuro-fuzzy approach to processing inputs with missing values in pattern recognition problems , 2002, Int. J. Approx. Reason..

[32]  Rex B. Kline,et al.  Principles and Practice of Structural Equation Modeling , 1998 .

[33]  Tshilidzi Marwala,et al.  A dynamic programming approach to missing data estimation using neural networks , 2013, Inf. Sci..

[34]  S. Nordbotten Neural network imputation applied to the Norwegian 1990 population census data , 1996 .

[35]  Alessandro G. Di Nuovo,et al.  Missing data analysis with fuzzy C-Means: A study of its application in a psychological scenario , 2011, Expert Syst. Appl..

[36]  Tariq Samad,et al.  Self–organization with partial data , 1992 .

[37]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[38]  Leonardo Franco,et al.  Missing data imputation in breast cancer prognosis , 2006 .

[39]  Benito E. Flores,et al.  A pragmatic view of accuracy measurement in forecasting , 1986 .

[40]  R. Eberhart,et al.  Comparing inertia weights and constriction factors in particle swarm optimization , 2000, Proceedings of the 2000 Congress on Evolutionary Computation. CEC00 (Cat. No.00TH8512).

[41]  Tshilidzi Marwala,et al.  The use of genetic algorithms and neural networks to approximate missing data in database , 2005, IEEE 3rd International Conference on Computational Cybernetics, 2005. ICCC 2005..

[42]  John O. Odiyo,et al.  Filling of missing rainfall data in Luvuvhu River Catchment using artificial neural networks , 2011 .

[43]  Qinbao Song,et al.  A new imputation method for small software project data sets , 2007, J. Syst. Softw..

[44]  Gustavo E. A. P. A. Batista,et al.  A Study of K-Nearest Neighbour as an Imputation Method , 2002, HIS.

[45]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[46]  Vadlamani Ravi,et al.  Particle swarm optimization and covariance matrix based data imputation , 2013, 2013 IEEE International Conference on Computational Intelligence and Computing Research.

[47]  Wayne S. DeSarbo,et al.  A Constrained Unfolding Methodology for Product Positioning , 1986 .

[48]  Shichao Zhang,et al.  Noisy data elimination using mutual k-nearest neighbor for classification mining , 2012, J. Syst. Softw..

[49]  Chee Kheong Siew,et al.  Extreme learning machine: Theory and applications , 2006, Neurocomputing.

[50]  M. Beynon,et al.  Variable precision rough set theory and data discretisation: an application to corporate failure prediction , 2001 .

[51]  L. L. Doove,et al.  Recursive partitioning for missing data imputation in the presence of interaction effects , 2014, Comput. Stat. Data Anal..

[52]  Bruno Crémilleux,et al.  MVC - a preprocessing method to deal with missing values , 1999, Knowl. Based Syst..

[53]  Gustavo E. A. P. A. Batista,et al.  Experimental comparison pf K-NEAREST NEIGHBOUR and MEAN OR MODE imputation methods with the internal strategies used by C4.5 and CN2 to treat missing data , 2003 .

[54]  Alan Agresti,et al.  Categorical Data Analysis , 1991, International Encyclopedia of Statistical Science.

[55]  Teresa B. Ludermir,et al.  Comparison of new activation functions in neural network for forecasting financial time series , 2011, Neural Computing and Applications.

[56]  T. Marwala,et al.  Fault classification in structures with incomplete measured data using autoassociative neural networks and genetic algorithm , 2006 .

[57]  Soo-Young Lee,et al.  Training Algorithm with Incomplete Data for Feed-Forward Neural Networks , 1999, Neural Processing Letters.

[58]  M. Marseguerra,et al.  The autoassociative neural network in signal analysis: III. Enhancing the reliability of a NN with application to a BWR , 2006 .

[59]  James Kennedy,et al.  Particle swarm optimization , 1995, Proceedings of ICNN'95 - International Conference on Neural Networks.

[60]  Ahmet Arslan,et al.  A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm , 2013, Inf. Sci..