Generative Adversarial Networks Imputation for High Rate Missing Values

The issue of missing values (MVs) has been found widely in real-world datasets and obstructed the use of many statistical or machine learning algorithms for data analytics due to their incompetence in processing incomplete datasets. Most of the current MVs imputation methods apply to the datasets with certain specific types or low missing rate. To address this problem, we propose a new method the missing completely at random (MCAR) data with high missing rate. This method is based on generative adversarial networks (GAN) architecture. We execute the training process on discrete dataset with missing values, in order to ensure the generated dataset is completely similar to the feature distribution of original dataset. We conduct our experiments for two different datatypes to prove the feasibility and efficiency of this method. The first one is a public authority dataset with wireless sensors records. The second one is a large group of dataset collected from an industrial production monitoring process. The results compared with traditional missing values imputation methods have shown when the missing rate is higher than 30%, our method performs better in robustness and accuracy.