Efficient approaches to interleaved sampling of training data for symbolic regression

The ability to generalize beyond the training set is paramount for any machine learning algorithm and Genetic Programming (GP) is no exception. This paper investigates a recently proposed technique to improve generalisation in GP, termed Interleaved Sampling where GP alternates between using the entire data set and only a single data point in alternate generations. This paper proposes two alternatives to using a single data point: the use of random search instead of a single data point, and simply minimising the tree size. Both the approaches are more efficient than the original Interleaved Sampling because they simply do not evaluate the fitness in half the number of generations. The results show that in terms of generalisation, random search and size minimisation are as effective as the original Interleaved Sampling; however, they are computationally more efficient in terms of data processing. Size minimisation is particularly interesting because it completely prevents bloat while still being competitive in terms of training results as well as generalisation. The tree sizes with size minimisation are substantially smaller reducing the computational expense substantially.

[1]  Maarten Keijzer,et al.  Improving Symbolic Regression with Interval Arithmetic and Linear Scaling , 2003, EuroGP.

[2]  William B. Langdon,et al.  Quadratic Bloat in Genetic Programming , 2000, GECCO.

[3]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[4]  Riccardo Poli,et al.  Parsimony pressure made easy , 2008, GECCO '08.

[5]  I-Cheng Yeh,et al.  Modeling of strength of high-performance concrete using artificial neural networks , 1998 .

[6]  Peter Ross,et al.  Dynamic Training Subset Selection for Supervised Learning in Genetic Programming , 1994, PPSN.

[7]  Leonardo Vanneschi,et al.  Genetic programming for human oral bioavailability of drugs , 2006, GECCO.

[8]  Lothar Thiele,et al.  Genetic Programming and Redundancy , 1994 .

[9]  Conor Ryan,et al.  Bootstrapping to reduce bloat and improve generalisation in genetic programming , 2013, GECCO '13 Companion.

[10]  Leonardo Vanneschi,et al.  Open issues in genetic programming , 2010, Genetic Programming and Evolvable Machines.

[11]  Byoung-Tak Zhang,et al.  Balancing Accuracy and Parsimony in Genetic Programming , 1995, Evolutionary Computation.

[12]  Terence Soule,et al.  Effects of Code Growth and Parsimony Pressure on Populations in Genetic Programming , 1998, Evolutionary Computation.

[13]  Leonardo Vanneschi,et al.  Measuring bloat, overfitting and functional complexity in genetic programming , 2010, GECCO '10.

[14]  Ivo Gonçalves,et al.  Balancing Learning and Overfitting in Genetic Programming with Interleaved Sampling of Training Data , 2013, EuroGP.

[15]  Taghi M. Khoshgoftaar,et al.  Reducing overfitting in genetic programming models for software quality classification , 2004, Eighth IEEE International Symposium on High Assurance Systems Engineering, 2004. Proceedings..

[16]  Dick den Hertog,et al.  Order of Nonlinearity as a Complexity Measure for Models Generated by Symbolic Regression via Pareto Genetic Programming , 2009, IEEE Transactions on Evolutionary Computation.

[17]  Conor Ryan,et al.  Artificial evolution approaches to address the data challenges , 2010 .

[18]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[19]  Conor Ryan,et al.  Abstract functions and lifetime learning in genetic programming for symbolic regression , 2010, GECCO '10.

[20]  Conor Ryan,et al.  Variance based selection to improve test set performance in genetic programming , 2011, GECCO '11.