Contemporary Symbolic Regression Methods and their Relative Performance

Many promising approaches to symbolic regression have been presented in recent years, yet progress in the field continues to suffer from a lack of uniform, robust, and transparent benchmarking standards. We address this shortcoming by introducing an open-source, reproducible benchmarking platform for symbolic regression. We assess 14 symbolic regression methods and 7 machine learning methods on a set of 252 diverse regression problems. Our assessment includes both real-world datasets with no known model form as well as ground-truth benchmark problems. For the real-world datasets, we benchmark the ability of each method to learn models with low error and low complexity relative to state-of-the-art machine learning methods. For the synthetic problems, we assess each method’s ability to find exact solutions in the presence of varying levels of noise. Under these controlled experiments, we conclude that the best performing methods for real-world regression combine genetic algorithms with parameter estimation and/or semantic search drivers. When tasked with recovering exact equations in the presence of noise, we find that several approaches perform similarly. We provide a detailed guide to reproducing this experiment and contributing new methods, and encourage other researchers to collaborate with us on a common and living symbolic regression benchmark. ∗corresponding author. Formerly Institute for Biomedical Informatics, University of Pennsylvania †Department of Automatics and Robotics, AGH University of Science and Technology, Krakow, Poland ‡Center for Mathematics, Computation and Cognition | Heuristics, Analysis and Learning Laboratory §Formerly (during preparation of this paper) at Chalmers University of Technology, Sweden 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks.

[1]  Leonardo Vanneschi,et al.  Genetic programming needs better benchmarks , 2012, GECCO '12.

[2]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[3]  Wojciech Jaskowski,et al.  Better GP benchmarks: community survey results and proposals , 2012, Genetic Programming and Evolvable Machines.

[4]  Krzysztof Krawiec,et al.  Modeling global temperature changes with genetic programming , 2012, Comput. Math. Appl..

[5]  Christoph H. Lampert,et al.  Learning Equations for Extrapolation and Control , 2018, ICML.

[6]  Guilherme Seidyo Imai Aldeia,et al.  Interaction–Transformation Evolutionary Algorithm for Symbolic Regression , 2019, Evolutionary Computation.

[7]  Alan Wright,et al.  Automatic identification of wind turbine models using evolutionary multiobjective optimization , 2016 .

[8]  Stephan M. Winkler,et al.  Genetic Algorithms and Genetic Programming , 2010 .

[9]  Jason H. Moore,et al.  Genetic programming approaches to learning fair classifiers , 2020, GECCO.

[10]  Hod Lipson,et al.  Automated modeling of stochastic reactions with large measurement time-gaps , 2011, GECCO '11.

[11]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[12]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[13]  Georg Martius,et al.  Informed Equation Learning , 2021, ArXiv.

[14]  Stephan M. Winkler,et al.  Evolving Simple Symbolic Regression Models by Multi-Objective Genetic Programming , 2016 .

[15]  Luís Torgo,et al.  OpenML: networked science in machine learning , 2014, SKDD.

[16]  D. Lathrop Nonlinear Dynamics and Chaos: With Applications to Physics, Biology, Chemistry, and Engineering , 2015 .

[17]  Gabriel Kronberger,et al.  Operon C++: an efficient genetic programming framework for symbolic regression , 2020, GECCO Companion.

[18]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[19]  Hod Lipson,et al.  Age-fitness pareto optimization , 2010, GECCO '10.

[20]  Lothar Thiele,et al.  Multiobjective genetic programming: reducing bloat using SPEA2 , 2001, Proceedings of the 2001 Congress on Evolutionary Computation (IEEE Cat. No.01TH8546).

[21]  Randal S. Olson,et al.  PMLB: a large benchmark suite for machine learning evaluation and comparison , 2017, BioData Mining.

[22]  Cynthia Rudin,et al.  Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead , 2018, Nature Machine Intelligence.

[23]  Grant Dick,et al.  Feature standardisation and coefficient optimisation for effective symbolic regression , 2020, GECCO.

[24]  Chandan Singh,et al.  Definitions, methods, and applications in interpretable machine learning , 2019, Proceedings of the National Academy of Sciences.

[25]  Matt J. Kusner,et al.  Grammar Variational Autoencoder , 2017, ICML.

[26]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[27]  Shu-Heng Chen,et al.  Genetic Algorithms and Genetic Programming in Computational Finance , 2002 .

[28]  Peter A. N. Bosman,et al.  Scalable genetic programming by gene-pool optimal mixing and input-space entropy-based building-block learning , 2017, GECCO.

[29]  Jason H. Moore,et al.  PMLB v1.0: an open source dataset collection for benchmarking machine learning methods , 2020, ArXiv.

[30]  Fabrício Olivetti de França,et al.  A greedy search tree heuristic for symbolic regression , 2018, Inf. Sci..

[31]  Trent McConaghy,et al.  FFX: Fast, Scalable, Deterministic Symbolic Regression Technology , 2011 .

[32]  Marco Virgolin,et al.  Model learning with personalized interpretability estimation (ML-PIE) , 2021, GECCO Companion.

[33]  Daniel G. Goldstein,et al.  Manipulating and Measuring Model Interpretability , 2018, CHI.

[34]  Maysum Panju Automated Knowledge Discovery using Neural Networks , 2021 .

[35]  Brenden K. Petersen Deep symbolic regression: Recovering mathematical expressions from data via risk-seeking policy gradients , 2021, ICLR.

[36]  Hod Lipson,et al.  Coevolution of Fitness Predictors , 2008, IEEE Transactions on Evolutionary Computation.

[37]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[38]  Lee Spector,et al.  Inference of compact nonlinear dynamic models by epigenetic local search , 2016, Eng. Appl. Artif. Intell..

[39]  Hod Lipson,et al.  Nonlinear system identification using coevolution of models and tests , 2005, IEEE Transactions on Evolutionary Computation.

[40]  Cees Witteveen,et al.  Improving Model-Based Genetic Programming for Symbolic Regression of Small Expressions , 2019, Evolutionary Computation.

[41]  Bayesian Symbolic Regression , 2019, 1910.08892.

[42]  A. Topchy,et al.  Faster genetic programming based on local gradient search of numeric leaf values , 2001 .

[43]  Kalyanmoy Deb,et al.  A Fast Elitist Non-dominated Sorting Genetic Algorithm for Multi-objective Optimisation: NSGA-II , 2000, PPSN.

[44]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[45]  Lee Spector,et al.  Epsilon-Lexicase Selection for Regression , 2016, GECCO.

[46]  Jason H. Moore,et al.  A probabilistic and multi-objective analysis of lexicase selection and epsilon-lexicase selection. , 2017 .

[47]  Mark Kotanchek,et al.  Pareto-Front Exploitation in Symbolic Regression , 2005 .

[48]  Krzysztof Krawiec,et al.  Semantic Backpropagation for Designing Search Operators in Genetic Programming , 2015, IEEE Transactions on Evolutionary Computation.

[49]  Michael D. Schmidt,et al.  Automated refinement and inference of analytical models for metabolic networks , 2011, Physical biology.

[50]  Stephan M. Winkler,et al.  Effects of constant optimization by nonlinear least squares minimization in symbolic regression , 2013, GECCO.

[51]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[52]  Anna Jobin,et al.  The global landscape of AI ethics guidelines , 2019, Nature Machine Intelligence.

[53]  Max Tegmark,et al.  AI Feynman 2.0: Pareto-optimal symbolic regression exploiting graph modularity , 2020, NeurIPS.

[54]  Robert E. Schapire,et al.  The Boosting Approach to Machine Learning An Overview , 2003 .

[55]  Gregory Hornby,et al.  ALPS: the age-layered population structure for reducing the problem of premature convergence , 2006, GECCO.

[56]  Jason H. Moore,et al.  Where are we now?: a large benchmark study of recent symbolic regression methods , 2018, GECCO.

[57]  Eric Medvet,et al.  Learning a Formula of Interpretability to Learn Interpretable Formulas , 2020, PPSN.

[58]  Hod Lipson,et al.  Distilling Free-Form Natural Laws from Experimental Data , 2009, Science.

[59]  W. La Cava,et al.  Application of concise machine learning to construct accurate and interpretable EHR computable phenotypes , 2020, medRxiv.

[60]  Hod Lipson,et al.  Machine science: automated modeling of deterministic and stochastic dynamical systems , 2011 .

[61]  Krzysztof Krawiec,et al.  Multiple regression genetic programming , 2014, GECCO.

[62]  Krzysztof Krawiec,et al.  Running programs backwards: instruction inversion for effective search in semantic spaces , 2013, GECCO '13.

[63]  Max Tegmark,et al.  AI Feynman: A physics-inspired method for symbolic regression , 2019, Science Advances.

[64]  Leonardo Vanneschi,et al.  A C++ framework for geometric semantic genetic programming , 2014, Genetic Programming and Evolvable Machines.

[65]  R. Leighton,et al.  The Feynman Lectures on Physics; Vol. I , 1965 .

[66]  Krzysztof Krawiec,et al.  Approximating geometric crossover by semantic backpropagation , 2013, GECCO '13.

[67]  Krzysztof Krawiec,et al.  Behavioral Program Synthesis with Genetic Programming , 2015, Studies in Computational Intelligence.

[68]  Dick den Hertog,et al.  Order of Nonlinearity as a Complexity Measure for Models Generated by Symbolic Regression via Pareto Genetic Programming , 2009, IEEE Transactions on Evolutionary Computation.

[69]  Gabriel Kronberger,et al.  Parameter identification for symbolic regression using nonlinear least squares , 2019, Genetic Programming and Evolvable Machines.

[70]  Seth Neel,et al.  Preventing Fairness Gerrymandering: Auditing and Learning for Subgroup Fairness , 2017, ICML.

[71]  Jan Žegklitz,et al.  Benchmarking state-of-the-art symbolic regression algorithms , 2020, Genetic Programming and Evolvable Machines.

[72]  Peter A. N. Bosman,et al.  Linear scaling with and within semantic backpropagation-based genetic programming for symbolic regression , 2019, GECCO.

[73]  Krzysztof Krawiec,et al.  Discovery of search objectives in continuous domains , 2017, GECCO.

[74]  Jason H. Moore,et al.  Learning concise representations for regression by evolving networks of trees , 2018, ICLR.

[75]  Hod Lipson,et al.  Comparison of tree and graph encodings as function of problem complexity , 2007, GECCO '07.

[76]  Marco Laumanns,et al.  SPEA2: Improving the strength pareto evolutionary algorithm , 2001 .

[77]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[78]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.