A Re-Examination of the Use of Genetic Programming on the Oral Bioavailability Problem

Difficult benchmark problems are in increasing demand in Genetic Programming (GP). One problem seeing increased usage is the oral bioavailability problem, which is often presented as a challenging problem to both GP and other machine learning methods. However, few properties of the bioavailability data set have been demonstrated, so attributes that make it a challenging problem are largely unknown. This work uncovers important properties of the bioavailability data set, and suggests that the perceived difficulty in this problem can be partially attributed to a lack of pre-processing, including features within the data set that contain no information, and contradictory relationships between the dependent and independent features of the data set. The paper then re-examines the performance of GP on this data set, and contextualises this performance relative to other regression methods. Results suggest that a large component of the observed performance differences on the bioavailability data set can be attributed to variance in the selection of training and testing data. Differences in performance between GP and other methods disappear when multiple training/testing splits are used within experimental work, with performance typically no better than a null modelling approach of reporting the mean of the training data.

[1]  Leonardo Vanneschi,et al.  Genetic programming needs better benchmarks , 2012, GECCO '12.

[2]  Conor Ryan,et al.  A Simple Approach to Lifetime Learning in Genetic Programming-Based Symbolic Regression , 2014, Evolutionary Computation.

[3]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[4]  Leonardo Vanneschi,et al.  Bloat free genetic programming: application to human oral bioavailability prediction , 2012, Int. J. Data Min. Bioinform..

[5]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[6]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[7]  Leonardo Vanneschi,et al.  Using crossover based similarity measure to improve genetic programming generalization ability , 2009, GECCO.

[8]  Leonardo Vanneschi,et al.  Genetic programming for computational pharmacokinetics in drug discovery and development , 2007, Genetic Programming and Evolvable Machines.

[9]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[10]  Grant Dick,et al.  Bloat and Generalisation in Symbolic Regression , 2014, SEAL.

[11]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[12]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[13]  Robin Harper,et al.  Spatial co-evolution: quicker, fitter and less bloated , 2012, GECCO '12.

[14]  Mohammad Mehdi Ebadzadeh,et al.  Improving GP generalization: a variance-based layered learning approach , 2014, Genetic Programming and Evolvable Machines.

[15]  Leonardo Vanneschi Investigating Problem Hardness of Real Life Applications , 2008 .

[16]  Carlos M. Fonseca,et al.  On the Generalization Ability of Geometric Semantic Genetic Programming , 2015, EuroGP.

[17]  Leonardo Vanneschi,et al.  Genetic programming for human oral bioavailability of drugs , 2006, GECCO.

[18]  Conor Ryan,et al.  Efficient approaches to interleaved sampling of training data for symbolic regression , 2014, 2014 Sixth World Congress on Nature and Biologically Inspired Computing (NaBIC 2014).

[19]  Leonardo Vanneschi,et al.  The Importance of Being Flat–Studying the Program Length Distributions of Operator Equalisation , 2011 .

[20]  Leonardo Vanneschi,et al.  A New Implementation of Geometric Semantic GP and Its Application to Problems in Pharmacokinetics , 2013, EuroGP.

[21]  Leonardo Vanneschi,et al.  State-of-the-Art Genetic Programming for Predicting Human Oral Bioavailability of Drugs , 2010, IWPACBB.

[22]  F. Lombardo,et al.  Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. , 2001, Advanced drug delivery reviews.

[23]  Leonardo Vanneschi,et al.  A C++ framework for geometric semantic genetic programming , 2014, Genetic Programming and Evolvable Machines.