Benchmarking state-of-the-art symbolic regression algorithms

Symbolic regression (SR) is a powerful method for building predictive models from data without assuming any model structure. Traditionally, genetic programming (GP) was used as the SR engine. However, for these purely evolutionary methods it was quite hard to even accommodate the function to the range of the data and the training was consequently inefficient and slow. Recently, several SR algorithms emerged which employ multiple linear regression. This allows the algorithms to create models with relatively small error right from the beginning of the search. Such algorithms are claimed to be by orders of magnitude faster than SR algorithms based on classic GP. However, a systematic comparison of these algorithms on a common set of problems is still missing and there is no basis on which to decide which algorithm to use. In this paper we conceptually and experimentally compare several representatives of such algorithms: GPTIPS, FFX, and EFS. We also include GSGP-Red, which is an enhanced version of geometric semantic genetic programming, an important algorithm in the field of SR. They are applied as off-the-shelf, ready-to-use techniques, mostly using their default settings. The methods are compared on several synthetic SR benchmark problems as well as real-world ones ranging from civil engineering to aerodynamics and acoustics. Their performance is also related to the performance of three conventional machine learning algorithms: multiple regression, random forests and support vector regression. The results suggest that across all the problems, the algorithms have comparable performance. We provide basic recommendations to the user regarding the choice of the algorithm.

[1]  Una-May O'Reilly,et al.  Genetic Programming Theory and Practice II , 2005 .

[2]  Kalyan Veeramachaneni,et al.  Building Predictive Models via Feature Synthesis , 2015, GECCO.

[3]  Krzysztof Krawiec,et al.  Multiple regression genetic programming , 2014, GECCO.

[4]  Athanasios Tsanas,et al.  Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools , 2012 .

[5]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[6]  Vincenzo Cutello,et al.  Parallel Problem Solving from Nature - PPSN XII , 2012, Lecture Notes in Computer Science.

[7]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[8]  Maarten Keijzer,et al.  Scaled Symbolic Regression , 2004, Genetic Programming and Evolvable Machines.

[9]  Trent McConaghy,et al.  FFX: Fast, Scalable, Deterministic Symbolic Regression Technology , 2011 .

[10]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[11]  Jason H. Moore,et al.  Genetic Programming Theory and Practice IX , 2011 .

[12]  Sean Luke,et al.  Lexicographic Parsimony Pressure , 2002, GECCO.

[13]  Dominic P. Searson,et al.  GPTIPS: An Open Source Genetic Programming Toolbox For Multigene Symbolic Regression , 2010 .

[14]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[15]  Leonardo Vanneschi,et al.  Genetic programming needs better benchmarks , 2012, GECCO '12.

[16]  Hod Lipson,et al.  Distilling Free-Form Natural Laws from Experimental Data , 2009, Science.

[17]  Gisele L. Pappa,et al.  Solving the exponential growth of symbolic regression trees in geometric semantic genetic programming , 2018, GECCO.

[18]  Mark Kotanchek,et al.  Better Solutions Faster: Soft Evolution of Robust Regression Models InParetogeneticprogramming , 2008 .

[19]  Christopher. Simons,et al.  Machine learning with Python , 2017 .

[20]  Vinicius Veloso de Melo,et al.  Kaizen programming , 2014, GECCO.

[21]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[22]  Dominic P. Searson GPTIPS 2: An Open-Source Software Platform for Symbolic Data Mining , 2014, Handbook of Genetic Programming Applications.

[23]  Kenneth Chiu,et al.  Prioritized grammar enumeration: symbolic regression by dynamic programming , 2013, GECCO '13.

[24]  Peter Rockett,et al.  The Use of an Analytic Quotient Operator in Genetic Programming , 2013, IEEE Transactions on Evolutionary Computation.

[25]  Michael F. Korns Accuracy in Symbolic Regression , 2011 .

[26]  Ankit Garg,et al.  A multi-gene genetic programming model for estimating stress-dependent soil water retention curves , 2014, Computational Geosciences.

[27]  Terence Soule,et al.  Genetic Programming: Theory and Practice , 2003 .

[28]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[29]  I-Cheng Yeh,et al.  Modeling of strength of high-performance concrete using artificial neural networks , 1998 .

[30]  Krzysztof Krawiec,et al.  Geometric Semantic Genetic Programming , 2012, PPSN.

[31]  Dick den Hertog,et al.  Order of Nonlinearity as a Complexity Measure for Models Generated by Symbolic Regression via Pareto Genetic Programming , 2009, IEEE Transactions on Evolutionary Computation.