Probabilistic neural network based categorical data imputation

Real world datasets contain both numerical and categorical attributes. Very often missing values are present in both numerical and categorical attributes. The missing data has to be imputed as the inferences made from complete data are often more accurate and reliable than those made from incomplete data [15]. Also, most of the data mining algorithms cannot work with incomplete datasets. The paper proposes a novel soft computing architecture for categorical data imputation. The proposed imputation technique employs Probabilistic Neural Network (PNN) preceded by mode for imputing the missing categorical data. The effectiveness of the proposed imputation technique is tested on 4 benchmark datasets under the 10 fold-cross validation framework. In all datasets, except Mushroom, which are complete, some values, which are randomly removed, are treated as missing values. The performance of the proposed imputation technique is compared with that of 3 statistical and 3 machine learning methods for data imputation. The comparison of the mode+PNN imputation technique with mode, K-Nearest Neighbor (K-NN), Hot Deck (HD), Naive Bayes, Random Forest (RF) and J48 (Decision Tree) imputation techniques demonstrates that the proposed method is efficient, especially when the percentage of missing values is high, for records having more than one missing value and for records having a large number of categories for each categorical variable.

[1]  Pilar Rey-del-Castillo,et al.  Fuzzy min–max neural networks for categorical data: application to missing data imputation , 2012 .

[2]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[3]  Vadlamani Ravi,et al.  A Computational Intelligence Based Online Data Imputation Method: An Application For Banking , 2013, J. Inf. Process. Syst..

[4]  Piet M. T. Broersen,et al.  Autoregressive spectral analysis when observations are missing , 2004, Autom..

[5]  Anders Holmberg,et al.  Prelude to the Special Issue on Systems and Architectures for High-Quality Statistics Production , 2013 .

[6]  Phil D. Green,et al.  Handling missing data in speech recognition , 1994, ICSLP.

[7]  Nikola K. Kasabov,et al.  DENFIS: dynamic evolving neural-fuzzy inference system and its application for time-series prediction , 2002, IEEE Trans. Fuzzy Syst..

[8]  Jeffrey D Dawson,et al.  Complete imputation of missing repeated categorical data: one‐sample applications , 2002, Statistics in medicine.

[9]  Amit Gupta,et al.  Estimating Missing Values Using Neural Networks , 1996 .

[10]  Qiang Wang,et al.  Missing categorical data imputation approach based on similarity , 2012, 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[11]  A. Agresti,et al.  Categorical Data Analysis , 1991, International Encyclopedia of Statistical Science.

[12]  Man-Lai Tang,et al.  Grouped Dirichlet distribution: A new tool for incomplete categorical data analysis , 2008 .

[13]  Paul E. Green,et al.  AN ALTERNATING LEAST‐SQUARES PROCEDURE FOR ESTIMATING MISSING PREFERENCE DATA IN PRODUCT‐CONCEPT TESTING* , 1986 .

[14]  Qinbao Song,et al.  A new imputation method for small software project data sets , 2007, J. Syst. Softw..

[15]  Claes Wohlin,et al.  An evaluation of k-nearest neighbour imputation using Likert data , 2004 .

[16]  Stef van Buuren,et al.  Imputation of missing categorical data by maximizing internal consistency , 1992 .

[17]  Amaury Lendasse,et al.  X-SOM and L-SOM: A double classification approach for missing value imputation , 2010, Neurocomputing.

[18]  Aníbal R. Figueiras-Vidal,et al.  Pattern classification with missing data: a review , 2010, Neural Computing and Applications.

[19]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[20]  Stuart G Baker,et al.  A sensitivity analysis for nonrandomly missing categorical data arising from a national health disability survey. , 2003, Biostatistics.

[21]  Qiang Zhao,et al.  Bayesian method for learning graphical models with incompletely categorical data , 2003, Comput. Stat. Data Anal..

[22]  David H. Schoellhamer,et al.  Singular spectrum analysis for time series with missing data , 2001 .

[23]  Mulugeta Gebregziabher,et al.  Latent class based multiple imputation approach for missing categorical data. , 2010, Journal of statistical planning and inference.

[24]  P. Roth,et al.  Missing Data in Multiple Item Scales: A Monte Carlo Analysis of Missing Data Techniques , 1999 .

[25]  Bogdan Gabrys,et al.  Neuro-fuzzy approach to processing inputs with missing values in pattern recognition problems , 2002, Int. J. Approx. Reason..

[26]  Claudomiro Sales,et al.  Multi-objective genetic algorithm for missing data imputation , 2015, Pattern Recognit. Lett..

[27]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[28]  Donald F. Specht,et al.  Probabilistic neural networks , 1990, Neural Networks.

[29]  Lefteris Angelis,et al.  Categorical missing data imputation for software cost estimation by multinomial logistic regression , 2006, J. Syst. Softw..

[30]  Lukasz A. Kurgan,et al.  Impact of imputation of missing values on classification error for discrete data , 2008, Pattern Recognit..

[31]  Qinbao Song,et al.  A Short Note on Safest Default Missingness Mechanism Assumptions , 2004, Empirical Software Engineering.

[32]  Qinbao Song,et al.  Dealing with missing software project data , 2003, Proceedings. 5th International Workshop on Enterprise Networking and Computing in Healthcare Industry (IEEE Cat. No.03EX717).

[33]  Jiri Kaiser Algorithm for Missing Values Imputation in Categorical Data with Use of Association Rules , 2012, ArXiv.

[34]  Stephen Henley,et al.  The problem of missing data in geoscience databases , 2006, Comput. Geosci..

[35]  J. Vermunt,et al.  9. Multiple Imputation of Incomplete Categorical Data Using Latent Class Analysis , 2008 .

[36]  Paola Annoni,et al.  An imputation method for categorical variables with application to nonlinear principal component analysis , 2011, Comput. Stat. Data Anal..

[37]  T. Åstebro,et al.  How to Deal with Missing Categorical Data: Test of a Simple Bayesian Method , 2003 .

[38]  Tshilidzi Marwala,et al.  The use of genetic algorithms and neural networks to approximate missing data in database , 2005, IEEE 3rd International Conference on Computational Cybernetics, 2005. ICCC 2005..

[39]  Natalie Shlomo,et al.  Calibrated Hot-Deck Donor Imputation Subject to Edit Restrictions , 2013 .