Genetic Programming with Noise Sensitivity for Imputation Predictor Selection in Symbolic Regression with Incomplete Data

This paper presents a feature selection method that incorporates a sensitivity-based single feature importance measure in a context-based feature selection approach. The single-wise importance is based on the sensitivity of the learning performance with respect to adding noise to the predictive features. Genetic programming is used as a context-based selection mechanism, where the selection of features is determined by the change in the performance of the evolved genetic programming models when the feature is injected with noise. Imputation is a key strategy to mitigate the data incompleteness problem. However, it has been rarely investigated for symbolic regression on incomplete data. In this work, an attempt to contribute to filling this gap is presented. The proposed method is applied to selecting imputation predictors (features/variables) in symbolic regression with missing values. The evaluation is performed on real-world data sets considering three performance measures: imputation accuracy, symbolic regression performance, and features’ reduction ability. Compared with the benchmark methods, the experimental evaluation shows that the proposed method can achieve an enhanced imputation, improve the symbolic regression performance, and use smaller sets of selected predictors.

[1]  Li Chen,et al.  Noise-Based Feature Perturbation as a Selection Method for Microarray Data , 2007, ISBRA.

[2]  Qi Chen,et al.  A Hybrid GP-KNN Imputation for Symbolic Regression with Missing Values , 2018, Australasian Conference on Artificial Intelligence.

[3]  Grant Dick,et al.  Sensitivity-like analysis for feature selection in genetic programming , 2017, GECCO.

[4]  Qi Chen,et al.  Genetic Programming for Imputation Predictor Selection and Ranking in Symbolic Regression with High-Dimensional Incomplete Data , 2019, Australasian Conference on Artificial Intelligence.

[5]  Qi Chen,et al.  Hessian Complexity Measure for Genetic Programming-Based Imputation Predictor Selection in Symbolic Regression with Incomplete Data , 2020, EuroGP.

[6]  Kaitlyn Heidt Comparison of Imputation Methods for Mixed Data Missing at Random , 2019 .

[7]  R. Perera Research methods journal club: a gentle introduction to imputation of missing values , 2008, Evidence-based medicine.

[8]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[9]  Omid Bozorg-Haddad,et al.  Logical genetic programming (LGP) application to water resources management , 2019, Environmental Monitoring and Assessment.

[10]  Qi Chen,et al.  Genetic Programming-Based Simultaneous Feature Selection and Imputation for Symbolic Regression with Incomplete Data , 2019, ACPR.

[11]  Sibel Arslan,et al.  Multi Hive Artificial Bee Colony Programming for high dimensional symbolic regression with feature selection , 2019, Appl. Soft Comput..

[12]  Mengjie Zhang,et al.  Feature Selection to Improve Generalization of Genetic Programming for High-Dimensional Symbolic Regression , 2017, IEEE Transactions on Evolutionary Computation.

[13]  Mengjie Zhang,et al.  Improving performance of classification on incomplete data using feature selection and clustering , 2018, Appl. Soft Comput..

[14]  Negin Daneshpour,et al.  Missing value imputation using a novel grey based fuzzy c-means, mutual information based feature selection, and regression model , 2019, Expert Syst. Appl..

[15]  Mengjie Zhang,et al.  Evolutionary feature manipulation in data mining/big data , 2017, SEVO.

[16]  Michael F. Korns Evolutionary linear discriminant analysis for multiclass classification problems , 2017, GECCO.

[17]  Drisya Alex Thumba,et al.  Symbolic regression-based improved method for wind speed extrapolation from lower to higher altitudes for wind energy applications , 2020 .

[18]  Mengjie Zhang,et al.  Improving performance for classification with incomplete data using wrapper-based feature selection , 2016, Evol. Intell..

[19]  Laurence T. Yang,et al.  Incomplete high-dimensional data imputation algorithm using feature selection and clustering analysis on cloud , 2015, The Journal of Supercomputing.

[20]  Tomas Brandejsky,et al.  Model Identification from Incomplete Data Set Describing State Variable Subset Only - The Problem of Optimizing and Predicting Heuristic Incorporation into Evolutionary System , 2013, NOSTRADAMUS.

[21]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[22]  Qi Chen,et al.  A Genetic Programming-based Wrapper Imputation Method for Symbolic Regression with Incomplete Data , 2019, 2019 IEEE Symposium Series on Computational Intelligence (SSCI).

[23]  Mengjie Zhang,et al.  A Wrapper Feature Selection Approach to Classification with Missing Data , 2016, EvoApplications.

[24]  Luís Torgo,et al.  OpenML: networked science in machine learning , 2014, SKDD.

[25]  Michael F. Korns,et al.  Strong Typing, Swarm Enhancement, and Deep Learning Feature Selection in the Pursuit of Symbolic Regression-Classification , 2018, GPTP.

[26]  Ben Niu,et al.  Improving generalisation of genetic programming for high-dimensional symbolic regression with feature selection , 2016, 2016 IEEE Congress on Evolutionary Computation (CEC).

[27]  Una-May O'Reilly,et al.  Genetic Programming II: Automatic Discovery of Reusable Programs. , 1994, Artificial Life.