Sensitivity-like analysis for feature selection in genetic programming

Feature selection is an important process within machine learning problems. Through pressures imposed on models during evolution, genetic programming performs basic feature selection, and so analysis of the evolved models can provide some insights into the utility of input features. Previous work has tended towards a presence model of feature selection, where the frequency of a feature appearing within evolved models is a metric for its utility. In this paper, we identify some drawbacks with using this approach, and instead propose the integration of importance measures for feature selection that measure the influence of a feature within a model. Using sensitivity-like analysis methods inspired by importance measures used in random forest regression, we demonstrate that genetic programming introduces many features into evolved models that have little impact on a given model's behaviour, and this can mask the true importance of salient features. The paper concludes by exploring bloat control methods and adaptive terminal selection methods to influence the identification of useful features within the search performed by genetic programming, with results suggesting that a combination of adaptive terminal selection and bloat control may help to improve generalisation performance.

[1]  Scott Marian,et al.  Second International Symposium on Sensitivity Analysis of Model Output SAMO 98. , 2000 .

[2]  Grant Dick,et al.  A Re-Examination of the Use of Genetic Programming on the Oral Bioavailability Problem , 2015, GECCO.

[3]  Ekaterina Vladislavleva,et al.  Separating the wheat from the chaff: on feature selection and feature importance in regression random forests and symbolic regression , 2011, GECCO.

[4]  Grant Dick,et al.  Controlling Bloat through Parsimonious Elitist Replacement and Spatial Structure , 2013, EuroGP.

[5]  Xin Yao,et al.  A Survey on Evolutionary Computation Approaches to Feature Selection , 2016, IEEE Transactions on Evolutionary Computation.

[6]  I-Cheng Yeh,et al.  Modeling of strength of high-performance concrete using artificial neural networks , 1998 .

[7]  Shinichi Morishita,et al.  On Classification and Regression , 1998, Discovery Science.

[8]  Ben Niu,et al.  Improving generalisation of genetic programming for high-dimensional symbolic regression with feature selection , 2016, 2016 IEEE Congress on Evolutionary Computation (CEC).

[9]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[10]  Alexandros Agapitos,et al.  Feature selection for speaker verification using genetic programming , 2017, Evol. Intell..

[11]  Grant Dick,et al.  Bloat and Generalisation in Symbolic Regression , 2014, SEAL.

[12]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[13]  Grant Dick,et al.  Improving Geometric Semantic Genetic Programming with Safe Tree Initialisation , 2015, EuroGP.

[14]  D. Rubinfeld,et al.  Hedonic housing prices and the demand for clean air , 1978 .

[15]  J. Ross Quinlan,et al.  Combining Instance-Based and Model-Based Learning , 1993, ICML.

[16]  Leonardo Vanneschi,et al.  Genetic programming for human oral bioavailability of drugs , 2006, GECCO.

[17]  Grant Dick,et al.  Implicitly Controlling Bloat in Genetic Programming , 2010, IEEE Transactions on Evolutionary Computation.

[18]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[19]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[20]  Nikhil R. Pal,et al.  Genetic programming for simultaneous feature selection and classifier design , 2006, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[21]  Mengjie Zhang,et al.  Genetic Programming for Feature Subset Ranking in Binary Classification Problems , 2009, EuroGP.