A Hybrid GP-KNN Imputation for Symbolic Regression with Missing Values

In data science, missingness is a serious challenge when dealing with real-world data sets. Although many imputation approaches have been proposed to tackle missing values in machine learning, most studies focus on the classification task rather than the regression task. To the best of our knowledge, no study has been conducted to investigate the use of imputation methods when performing symbolic regression on incomplete real-world data sets. In this work, we propose a new imputation method called GP-KNN which is a hybrid method employing two concepts: Genetic Programming Imputation (GPI) and K-Nearest Neighbour (KNN). GP-KNN considers both the feature and instance relevance. The experimental results show that the proposed method has a better performance comparing to state-of-the-art imputation methods in most of the considered cases with respect to both imputation accuracy and symbolic regression performance.

[1]  Ken P Kleinman,et al.  Much Ado About Nothing , 2007, The American statistician.

[2]  Po-Ling Loh,et al.  High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity , 2011, NIPS.

[3]  Mengjie Zhang,et al.  Multiple imputation and genetic programming for classification with incomplete data , 2017, GECCO.

[4]  Y. Haitovsky Missing Data in Regression Analysis , 1968 .

[5]  Grant Dick,et al.  Bloat and Generalisation in Symbolic Regression , 2014, SEAL.

[6]  Yang Yuan,et al.  Multiple Imputation Using SAS Software , 2011 .

[7]  John R. Koza,et al.  Genetic programming as a means for programming computers by natural selection , 1994 .

[8]  Tomas Brandejsky,et al.  Model Identification from Incomplete Data Set Describing State Variable Subset Only - The Problem of Optimizing and Predicting Heuristic Incorporation into Evolutionary System , 2013, NOSTRADAMUS.

[9]  T. Stijnen,et al.  Review: a gentle introduction to imputation of missing values. , 2006, Journal of clinical epidemiology.

[10]  Moshe Looks,et al.  Improved Time Series Prediction and Symbolic Regression with Affine Arithmetic , 2011 .

[11]  Lorenzo Beretta,et al.  Nearest neighbor imputation algorithms: a critical evaluation , 2016, BMC Medical Informatics and Decision Making.

[12]  Marc Parizeau,et al.  DEAP: evolutionary algorithms made easy , 2012, J. Mach. Learn. Res..

[13]  Mengjie Zhang,et al.  Multiple Imputation for Missing Data Using Genetic Programming , 2015, GECCO.

[14]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[15]  Mengjie Zhang,et al.  A Genetic Programming-Based Imputation Method for Classification with Missing Data , 2016, EuroGP.

[16]  Mengjie Zhang,et al.  Feature Selection to Improve Generalization of Genetic Programming for High-Dimensional Symbolic Regression , 2017, IEEE Transactions on Evolutionary Computation.