A general feature engineering wrapper for machine learning using-lexicase survival

We propose a general wrapper for feature learning that interfaces with other machine learning methods to compose effective data representations. The proposed feature engineering wrapper (FEW) uses genetic programming to represent and evolve individual features tailored to the machine learning method with which it is paired. In order to maintain feature diversity, -lexicase survival is introduced, a method based on -lexicase selection. This survival method preserves semantically unique individuals in the population based on their ability to solve difficult subsets of training cases, thereby yielding a population of uncorrelated features. We demonstrate FEW with five different off-the-shelf machine learning methods and test it on a set of real-world and synthetic regression problems with dimensions varying across three orders of magnitude. The results show that FEW is able to improve model test predictions across problems for several ML methods. We discuss and test the scalability of FEW in comparison to other feature composition strategies, most notably polynomial feature expansion.

[1]  Lee Spector,et al.  Solving Uncompromising Problems With Lexicase Selection , 2015, IEEE Transactions on Evolutionary Computation.

[2]  Dick den Hertog,et al.  Order of Nonlinearity as a Complexity Measure for Models Generated by Symbolic Regression via Pareto Genetic Programming , 2009, IEEE Transactions on Evolutionary Computation.

[3]  George D. Smith,et al.  Evolutionary constructive induction , 2005, IEEE Transactions on Knowledge and Data Engineering.

[4]  Lee Spector,et al.  Assessment of problem modality by differential performance of lexicase selection in genetic programming: a preliminary report , 2012, GECCO '12.

[5]  George D. Smith,et al.  The Effect of Evolved Attributes on Classification Algorithms , 2003, Australian Conference on Artificial Intelligence.

[6]  Jessica Lin,et al.  SAX-EFG: an evolutionary feature generation framework for time series classification , 2014, GECCO.

[7]  Lee Spector,et al.  Epsilon-Lexicase Selection for Regression , 2016, GECCO.

[8]  Alok Baveja,et al.  Computing , Artificial Intelligence and Information Technology A data-driven software tool for enabling cooperative information sharing among police departments , 2002 .

[9]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[10]  Dean P. Foster,et al.  Variable Selection is Hard , 2014, COLT.

[11]  Wojciech Jaskowski,et al.  Better GP benchmarks: community survey results and proposals , 2012, Genetic Programming and Evolvable Machines.

[12]  Krzysztof Krawiec,et al.  Comparison of Semantic-aware Selection Methods in Genetic Programming , 2015, GECCO.

[13]  Kalyan Veeramachaneni,et al.  Building Predictive Models via Feature Synthesis , 2015, GECCO.

[14]  Josh C. Bongard,et al.  Improving genetic programming based symbolic regression using deterministic machine learning , 2013, 2013 IEEE Congress on Evolutionary Computation.

[15]  George D. Smith,et al.  Evolutionary Feature Construction Using Information Gain and Gini Index , 2004, EuroGP.

[16]  Alan Wright,et al.  Automatic identification of wind turbine models using evolutionary multiobjective optimization , 2016 .

[17]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[19]  Randal S. Olson,et al.  Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science , 2016, GECCO.

[20]  Adolfo Martínez Usó,et al.  UJIIndoorLoc: A new multi-building and multi-floor database for WLAN fingerprint-based indoor localization problems , 2014, 2014 International Conference on Indoor Positioning and Indoor Navigation (IPIN).

[21]  D. Rubinfeld,et al.  Hedonic housing prices and the demand for clean air , 1978 .

[22]  Athanasios Tsanas,et al.  Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools , 2012 .

[23]  Leonardo Vanneschi,et al.  A Study of Genetic Programming Variable Population Size for Dynamic Optimization Problems , 2009, IJCCI.

[24]  A. Topchy,et al.  Faster genetic programming based on local gradient search of numeric leaf values , 2001 .

[25]  Trent McConaghy,et al.  FFX: Fast, Scalable, Deterministic Symbolic Regression Technology , 2011 .

[26]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[27]  Tong Heng Lee,et al.  Evolutionary algorithms with dynamic population size and local exploration for multiobjective optimization , 2001, IEEE Trans. Evol. Comput..

[28]  Hitoshi Iba,et al.  Genetic Programming with Local Hill-Climbing , 1994, PPSN.

[29]  Stephan M. Winkler,et al.  Effects of constant optimization by nonlinear least squares minimization in symbolic regression , 2013, GECCO.