Automatic feature engineering for regression models with machine learning: An evolutionary computation and statistics hybrid

Abstract Symbolic Regression (SR) is a well-studied task in Evolutionary Computation (EC), where adequate free-form mathematical models must be automatically discovered from observed data. Statisticians, engineers, and general data scientists still prefer traditional regression methods over EC methods because of the solid mathematical foundations, the interpretability of the models, and the lack of randomness, even though such deterministic methods tend to provide lower quality prediction than stochastic EC methods. On the other hand, while EC solutions can be big and uninterpretable, they can be created with less bias, finding high-quality solutions that would be avoided by human researchers. Another interesting possibility is using EC methods to perform automatic feature engineering for a deterministic regression method instead of evolving a single model; this may lead to smaller solutions that can be easy to understand. In this contribution, we evaluate an approach called Kaizen Programming (KP) to develop a hybrid method employing EC and Statistics. While the EC method builds the features, the statistical method efficiently builds the models, which are also used to provide the importance of the features; thus, features are improved over the iterations resulting in better models. Here we examine a large set of benchmark SR problems known from the EC literature. Our experiments show that KP outperforms traditional Genetic Programming - a popular EC method for SR - and also shows improvements over other methods, including other hybrids and well-known statistical and Machine Learning (ML) ones. More in line with ML than EC approaches, KP is able to provide high-quality solutions while requiring only a small number of function evaluations.

[1]  Giancarlo Mauri,et al.  Heterogeneous cooperative coevolution: strategies of integration between GP and GA , 2006, GECCO.

[2]  David B. Fogel,et al.  Evolution-ary Computation 1: Basic Algorithms and Operators , 2000 .

[3]  Evelyne Lutton,et al.  Cooperative Co-evolution Inspired Operators for Classical GP Schemes , 2007, NICSO.

[4]  Julian Francis Miller,et al.  Cartesian genetic programming , 2000, GECCO '10.

[5]  Kenneth Chiu,et al.  Prioritized grammar enumeration: symbolic regression by dynamic programming , 2013, GECCO '13.

[6]  Vinicius Veloso de Melo,et al.  Predicting High-Performance Concrete Compressive Strength Using Features Constructed by Kaizen Programming , 2015, 2015 Brazilian Conference on Intelligent Systems (BRACIS).

[7]  Marc Parizeau,et al.  DEAP: evolutionary algorithms made easy , 2012, J. Mach. Learn. Res..

[8]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[9]  Thomas Bäck,et al.  Selective Pressure in Evolutionary Algorithms: A Characterization of Selection Mechanisms , 1994, International Conference on Evolutionary Computation.

[10]  Norman P. Bresky,et al.  Tools and Methods for the Improvement of Quality , 1990 .

[11]  Vinicius Veloso de Melo,et al.  Breast cancer detection with logistic regression improved by features constructed by Kaizen programming in a hybrid approach , 2016, 2016 IEEE Congress on Evolutionary Computation (CEC).

[12]  W. Banzhaf,et al.  Improving Logistic Regression Classification of Credit Approval with Features Constructed by Kaizen Programming , 2016, GECCO.

[13]  Vinicius Veloso de Melo,et al.  Solving the Lawn Mower problem with Kaizen Programming and λ-Linear Genetic Programming for Module Acquisition , 2016, GECCO.

[14]  Vinicius Veloso de Melo,et al.  Kaizen programming , 2014, GECCO.

[15]  Kenneth A. De Jong,et al.  Cooperative Coevolution: An Architecture for Evolving Coadapted Subcomponents , 2000, Evolutionary Computation.

[16]  Timothy Perkis,et al.  Stack-based genetic programming , 1994, Proceedings of the First IEEE Conference on Evolutionary Computation. IEEE World Congress on Computational Intelligence.

[17]  Josh C. Bongard,et al.  Improving genetic programming based symbolic regression using deterministic machine learning , 2013, 2013 IEEE Congress on Evolutionary Computation.

[18]  T. Soule,et al.  Orthogonal Evolution of Teams: A Class of Algorithms for Evolving Teams with Inversely Correlated Errors , 2007 .

[19]  Cândida Ferreira Gene Expression Programming in Problem Solving , 2002 .

[20]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[21]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[22]  J. Friedman Multivariate adaptive regression splines , 1990 .

[23]  Mihai Oltean,et al.  Evolving Evolutionary Algorithms Using Multi Expression Programming , 2003, ECAL.

[24]  Trent McConaghy,et al.  FFX: Fast, Scalable, Deterministic Symbolic Regression Technology , 2011 .

[25]  Vinicius Veloso de Melo,et al.  Kaizen Programming for Feature Construction for Classification , 2016 .

[26]  Michael F. Korns Abstract Expression Grammar Symbolic Regression , 2011 .

[27]  Vinicius Veloso de Melo,et al.  Improving the prediction of material properties of concrete using Kaizen Programming with Simulated Annealing , 2017, Neurocomputing.

[28]  Douglas C. Montgomery,et al.  Response Surface Methodology: Process and Product Optimization Using Designed Experiments , 1995 .

[29]  Peter Nordin,et al.  Genetic programming - An Introduction: On the Automatic Evolution of Computer Programs and Its Applications , 1998 .

[30]  Krzysztof Krawiec,et al.  Behavioral programming: a broader and more detailed take on semantic GP , 2014, GECCO.

[31]  Maarten Keijzer,et al.  Scaled Symbolic Regression , 2004, Genetic Programming and Evolvable Machines.

[32]  Wolfgang Banzhaf,et al.  Evolving Teams of Predictors with Linear Genetic Programming , 2001, Genetic Programming and Evolvable Machines.

[33]  Michael O'Neill,et al.  Genetic Programming and Evolvable Machines Manuscript No. Semantically-based Crossover in Genetic Programming: Application to Real-valued Symbolic Regression , 2022 .

[34]  Michael O'Neill,et al.  Grammatical Evolution: Evolving Programs for an Arbitrary Language , 1998, EuroGP.

[35]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[36]  Krzysztof Krawiec,et al.  Multiple regression genetic programming , 2014, GECCO.

[37]  Lee Spector,et al.  Genetic Programming and Autoconstructive Evolution with the Push Programming Language , 2002, Genetic Programming and Evolvable Machines.

[38]  Mark Kotanchek,et al.  Trustable symbolic regression models: using ensembles, interval arithmetic and pareto fronts to develop robust and trust-aware models , 2008 .

[39]  Leonardo Vanneschi,et al.  Genetic programming needs better benchmarks , 2012, GECCO '12.

[40]  Dominic P. Searson GPTIPS 2: An Open-Source Software Platform for Symbolic Data Mining , 2014, Handbook of Genetic Programming Applications.

[41]  Vinicius Veloso de Melo,et al.  Classification of Cardiac Arrhythmia by Random Forests with Features Constructed by Kaizen Programming with Linear Genetic Programming , 2016, GECCO.