Imputation of missing data with neural networks for classification

Abstract We propose a mechanism to use data with missing values for designing classifiers which is different from predicting missing values for classification. Our imputation method uses an auto-encoder neural network. We make an innovative use of the training data without missing values to train the auto-encoder so that it is better equipped to predict missing values. It is a two-stage training scheme. Unlike most of the existing auto-encoder based methods which use a bottleneck layer for missing data handling, we justify and use a latent space of much higher dimension than that of the input. Now to design a classifier using a training set with missing values, we use the trained auto-encoder to predict missing values based on the hypothesis that a good choice for a missing value would be the one which can reconstruct itself via the auto-encoder. For this we make an initial guess of the missing value using the nearest neighbor rule and then refine the missing value minimizing the reconstruction error. We train several classifiers using the union of the imputed instances and the remaining training instances without missing values. We also train another classifier of the same type with the same configuration using the corresponding complete dataset. The performances of these classifiers are compared. We compare the proposed method with eight state-of-the-art imputation techniques using fourteen datasets and eight classification strategies.

[1]  Vadlamani Ravi,et al.  Data imputation via evolutionary computation, clustering and a neural network , 2015, Neurocomputing.

[2]  Jianzhong Li,et al.  FROG: Inference from knowledge base for missing value imputation , 2018, Knowl. Based Syst..

[3]  Yi Pan,et al.  An Iterative Locally Auto-Weighted Least Squares Method for Microarray Missing Value Estimation , 2017, IEEE Transactions on NanoBioscience.

[4]  V. Miranda,et al.  Reconstructing Missing Data in State Estimation With Autoencoders , 2012, IEEE Transactions on Power Systems.

[5]  M.A. El-Sharkawi,et al.  Missing sensor data restoration for vibration sensors on a jet aircraft engine , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[6]  Tariq Samad,et al.  Imputation of Missing Data in Industrial Databases , 1999, Applied Intelligence.

[7]  Hidetomo Ichihashi,et al.  Linear fuzzy clustering techniques with missing values and their application to local principal component analysis , 2004, IEEE Transactions on Fuzzy Systems.

[8]  R. Polikar,et al.  An ensemble technique to handle missing data from sensors , 2006, Proceedings of the 2006 IEEE Sensors Applications Symposium, 2006..

[9]  Jiawei Han,et al.  Multi-View Clustering via Joint Nonnegative Matrix Factorization , 2013, SDM.

[10]  Vladimiro Miranda,et al.  Towards an Auto-Associative Topology State Estimator , 2013, IEEE Transactions on Power Systems.

[11]  Vadlamani Ravi,et al.  Counter propagation auto-associative neural network based data imputation , 2015, Inf. Sci..

[12]  Sophie Midenet,et al.  Self-Organising Map for Data Imputation and Correction in Surveys , 2002, Neural Computing & Applications.

[13]  Kang Li,et al.  A Robust Fuzzy c-Means Clustering Algorithm for Incomplete Data , 2017, LSMS/ICSEE.

[14]  Basav Roychoudhury,et al.  Handling missing values: A study of popular imputation packages in R , 2018, Knowl. Based Syst..

[15]  John K. Dixon,et al.  Pattern Recognition with Partly Missing Data , 1979, IEEE Transactions on Systems, Man, and Cybernetics.

[16]  Md Zahidul Islam,et al.  FIMUS: A framework for imputing missing values using co-appearance, correlation and similarity analysis , 2014, Knowl. Based Syst..

[17]  M. Marseguerra,et al.  The AutoAssociative Neural Network in signal analysis: II. Application to on-line monitoring of a simulated BWR component , 2005 .

[18]  Francis L. Merat,et al.  Neural network based sensor array signal processing , 1996, 1996 IEEE/SICE/RSJ International Conference on Multisensor Fusion and Integration for Intelligent Systems (Cat. No.96TH8242).

[19]  G. DiCesare Imputation, Estimation and Missing Data in Finance , 2006 .

[20]  Shiji Song,et al.  Robust K-Median and K-Means Clustering Algorithms for Incomplete Data , 2016 .

[21]  Kumar,et al.  Neural Networks a Classroom Approach , 2004 .

[22]  Aníbal R. Figueiras-Vidal,et al.  Pattern classification with missing data: a review , 2010, Neural Computing and Applications.

[23]  Li Zhang,et al.  Missing Data Imputation by Nearest-neighbor Trained BP for Fuzzy Clustering ⋆ , 2014 .

[24]  Tariq Samad,et al.  Self–organization with partial data , 1992 .

[25]  Zhi Gao,et al.  Robust neuro-identification of nonlinear plants in electric power systems with missing sensor measurements , 2008, Eng. Appl. Artif. Intell..

[26]  R.J. Marks,et al.  On the contractive nature of autoencoders: application to missing sensor restoration , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[27]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[28]  Le Gruenwald,et al.  Estimating Missing Values in Related Sensor Data Streams , 2005, COMAD.

[29]  Leonardo Franco,et al.  Missing data imputation in breast cancer prognosis , 2006 .

[30]  Mia K. Markey,et al.  Impact of missing data in training artificial neural networks for computer-aided diagnosis , 2004, 2004 International Conference on Machine Learning and Applications, 2004. Proceedings..

[31]  R.J. Marks,et al.  Set constraint discovery: missing sensor data restoration using autoassociative regression machines , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[32]  Peng Liu,et al.  An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset , 2005, ADMA.

[33]  Chih-Fong Tsai,et al.  A class center based approach for missing value imputation , 2018, Knowl. Based Syst..

[34]  P. Kofman,et al.  Using Multiple Imputation in the Analysis of Incomplete Observations in Finance , 2003 .

[35]  M. Aldenderfer,et al.  Cluster Analysis. Sage University Paper Series On Quantitative Applications in the Social Sciences 07-044 , 1984 .

[36]  Hong Gu,et al.  A fuzzy c-means clustering algorithm based on nearest-neighbor intervals for incomplete data , 2010, Expert Syst. Appl..

[37]  Yun Fu,et al.  Incomplete Multi-Modal Visual Data Grouping , 2016, IJCAI.

[38]  Phil D. Green,et al.  Speech enhancement with missing data techniques using recurrent neural networks , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[39]  Graham Kalton,et al.  Compensating for missing survey data , 1982 .

[40]  Victor C. M. Leung,et al.  Incomplete multi-view clustering via deep semantic mapping , 2018, Neurocomputing.

[41]  Maria do Carmo Nicoletti,et al.  An embedded imputation method via Attribute-based Decision Graphs , 2016, Expert Syst. Appl..

[42]  Hong Gu,et al.  A hybrid genetic algorithm–fuzzy c-means approach for incomplete data clustering based on nearest-neighbor intervals , 2013, Soft Comput..

[43]  Alessandro G. Di Nuovo,et al.  Missing data analysis with fuzzy C-Means: A study of its application in a psychological scenario , 2011, Expert Syst. Appl..

[44]  Swagatam Das,et al.  Clustering with missing features: a penalized dissimilarity measure based approach , 2016, Machine Learning.

[45]  Tianrui Li,et al.  ST-MVL: Filling Missing Values in Geo-Sensory Time Series Data , 2016, IJCAI.

[46]  Esther-Lydia Silva-Ramírez,et al.  Missing value imputation on missing completely at random data using multilayer perceptrons , 2011, Neural Networks.

[47]  Jared S. Murray,et al.  Multiple Imputation: A Review of Practical and Theoretical Findings , 2018, 1801.04058.

[48]  Liang Wang,et al.  Unified subspace learning for incomplete and unlabeled multi-view data , 2017, Pattern Recognit..

[49]  Dapeng Oliver Wu,et al.  Why Deep Learning Works: A Manifold Disentanglement Perspective , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[50]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[51]  Michael A. Proschan,et al.  Sensitivity analysis using an imputation method for missing binary data in clinical trials , 2001 .

[52]  C. Willmott Some Comments on the Evaluation of Model Performance , 1982 .

[53]  Philip S. Yu,et al.  Multiple Incomplete Views Clustering via Weighted Nonnegative Matrix Factorization with L2, 1 Regularization , 2015, ECML/PKDD.

[54]  T. Marwala,et al.  Fault classification in structures with incomplete measured data using autoassociative neural networks and genetic algorithm , 2006 .

[55]  C. Ji,et al.  Measurement-based network monitoring: missing data formulation and scalability analysis , 2000, 2000 IEEE International Symposium on Information Theory (Cat. No.00CH37060).

[56]  Esther-Lydia Silva-Ramírez,et al.  Single imputation with multilayer perceptron and multiple imputation combining multilayer perceptron and k-nearest neighbours for monotone patterns , 2015, Appl. Soft Comput..

[57]  James C. Bezdek,et al.  Fuzzy c-means clustering of incomplete data , 2001, IEEE Trans. Syst. Man Cybern. Part B.