Particle swarm optimization and covariance matrix based data imputation

We propose a data imputation method based on Particle Swarm Optimization (PSO) and covariance matrix of the data. PSO is used to minimize the following error functions in a nested form (i) Mean squared error between the covariance matrix of the set of complete records and the covariance matrix of the set of total records including imputed ones. (ii) Absolute difference between the determinants of the two covariance matrices. The algorithm is designed to stop only when these two errors become very small across two consecutive iterations. The proposed method was tested on several regression, classification and banking datasets using 10 fold cross validation. The quality of the imputation is tested by using Mean Absolute Percentage Error (MAPE) value. We compared the results of the proposed method with that of a hybrid data imputation method based on K-means and Multi-layer Perceptron (MLP). We observed that while the the proposed preserves the covariance structure of the data, it achieved better imputation in most of the datasets as evidenced by the Wilcoxon signed rank test to test the statistical significance of the results.

[1]  Peter C. Austin,et al.  Bayesian modeling of missing data in clinical research , 2005, Comput. Stat. Data Anal..

[2]  Alessandro G. Di Nuovo,et al.  Missing data analysis with fuzzy C-Means: A study of its application in a psychological scenario , 2011, Expert Syst. Appl..

[3]  Rex B. Kline,et al.  Principles and Practice of Structural Equation Modeling , 1998 .

[4]  Fengzhan Tian,et al.  A selective Bayes Classifier for classifying incomplete data based on gain ratio , 2008, Knowl. Based Syst..

[5]  Juan Carlos Figueroa García,et al.  Missing data imputation in multivariate data by evolutionary algorithms , 2011, Comput. Hum. Behav..

[6]  R. Lowry,et al.  Concepts and Applications of Inferential Statistics , 2014 .

[7]  Slobodan P. Simonovic,et al.  Estimation of missing streamflow data using principles of chaos theory , 2002 .

[8]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[9]  David E. Booth,et al.  Analysis of Incomplete Multivariate Data , 2000, Technometrics.

[10]  Gustavo E. A. P. A. Batista,et al.  Experimental comparison pf K-NEAREST NEIGHBOUR and MEAN OR MODE imputation methods with the internal strategies used by C4.5 and CN2 to treat missing data , 2003 .

[11]  Tshilidzi Marwala,et al.  Partial imputation of unseen records to improve classification using a hybrid multi-layered artificial immune system and genetic algorithm , 2013, Appl. Soft Comput..

[12]  T. Marwala,et al.  Fault classification in structures with incomplete measured data using autoassociative neural networks and genetic algorithm , 2006 .

[13]  Leonardo Franco,et al.  Missing data imputation in breast cancer prognosis , 2006 .

[14]  Benito E. Flores,et al.  A pragmatic view of accuracy measurement in forecasting , 1986 .

[15]  Soo-Young Lee,et al.  Training Algorithm with Incomplete Data for Feed-Forward Neural Networks , 1999, Neural Processing Letters.

[16]  Esther-Lydia Silva-Ramírez,et al.  Missing value imputation on missing completely at random data using multilayer perceptrons , 2011, Neural Networks.

[17]  Tshilidzi Marwala,et al.  The use of genetic algorithms and neural networks to approximate missing data in database , 2005, IEEE 3rd International Conference on Computational Cybernetics, 2005. ICCC 2005..

[18]  Vadlamani Ravi,et al.  A Novel Soft Computing Hybrid for Data Imputation , 2022 .

[19]  David E. Booth,et al.  Applied Multivariate Analysis , 2003, Technometrics.

[20]  E. C. Oreja,et al.  Análisis multivariante de datos , 1995 .

[21]  Peter K. Sharpe,et al.  Dealing with missing values in neural network-based diagnostic systems , 1995, Neural Computing & Applications.

[22]  Ahmet Arslan,et al.  A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm , 2013, Inf. Sci..

[23]  Jonathan N. Crook,et al.  Credit Scoring and Its Applications , 2002, SIAM monographs on mathematical modeling and computation.

[24]  Pilsung Kang,et al.  Locally linear reconstruction based missing value imputation for supervised learning , 2013, Neurocomputing.

[25]  Wayne S. DeSarbo,et al.  A Constrained Unfolding Methodology for Product Positioning , 1986 .

[26]  Shichao Zhang,et al.  Noisy data elimination using mutual k-nearest neighbor for classification mining , 2012, J. Syst. Softw..

[27]  Amit Gupta,et al.  Estimating Missing Values Using Neural Networks , 1996 .

[28]  Ignacio Olmeda,et al.  Hybrid Classifiers for Financial Multicriteria Decision Making: The Case of Bankruptcy Prediction , 1997 .

[29]  D. Harville Matrix Algebra From a Statistician's Perspective , 1998 .

[30]  Serpil Canbas,et al.  Prediction of commercial bank failure via multivariate statistical analysis of financial structures: The Turkish case , 2005, Eur. J. Oper. Res..

[31]  Amaury Lendasse,et al.  X-SOM and L-SOM: A double classification approach for missing value imputation , 2010, Neurocomputing.

[32]  M. Marseguerra,et al.  The AutoAssociative Neural Network in signal analysis: II. Application to on-line monitoring of a simulated BWR component , 2005 .

[33]  Bogdan Gabrys,et al.  Neuro-fuzzy approach to processing inputs with missing values in pattern recognition problems , 2002, Int. J. Approx. Reason..

[34]  Tshilidzi Marwala,et al.  A dynamic programming approach to missing data estimation using neural networks , 2013, Inf. Sci..

[35]  S. Nordbotten Neural network imputation applied to the Norwegian 1990 population census data , 1996 .

[36]  Qinbao Song,et al.  A new imputation method for small software project data sets , 2007, J. Syst. Softw..

[37]  Gustavo E. A. P. A. Batista,et al.  A Study of K-Nearest Neighbour as an Imputation Method , 2002, HIS.

[38]  Riccardo Poli,et al.  Particle swarm optimization , 1995, Swarm Intelligence.

[39]  M. Beynon,et al.  Variable precision rough set theory and data discretisation: an application to corporate failure prediction , 2001 .

[40]  L. L. Doove,et al.  Recursive partitioning for missing data imputation in the presence of interaction effects , 2014, Comput. Stat. Data Anal..

[41]  Bruno Crémilleux,et al.  MVC - a preprocessing method to deal with missing values , 1999, Knowl. Based Syst..

[42]  A. Stuart,et al.  Non-Parametric Statistics for the Behavioral Sciences. , 1957 .

[43]  Aníbal R. Figueiras-Vidal,et al.  Classifying patterns with missing values using Multi-Task Learning perceptrons , 2013, Expert Syst. Appl..

[44]  John O. Odiyo,et al.  Filling of missing rainfall data in Luvuvhu River Catchment using artificial neural networks , 2011 .

[45]  Vadlamani Ravi,et al.  Soft computing based imputation and hybrid data and text mining: The case of predicting the severity of phishing alerts , 2012, Expert Syst. Appl..

[46]  Md Zahidul Islam,et al.  Missing value imputation using decision trees and decision forests by splitting and merging records: Two novel techniques , 2013, Knowl. Based Syst..

[47]  Alan Agresti,et al.  Categorical Data Analysis , 2003 .