The Effects of Randomly Sampled Training Data on Program Evolution

The effects of randomly sampled training data on genetic programming performance is empirically investigated. Often the most natural, if not only, means of characterizing the target behaviour for a problem is to randomly sample training cases inherent to that problem. A natural question to raise about this strategy is, how deleterious is the randomly sampling of training data to evolution performance? Will sampling reduce the evolutionary search to hill climbing? Can resampling during the run be advantageous? We address these questions by undertaking a suite of different GP experiments. Parameters include various sampling strategies (single, re-sampling, ideal samples), generational and steady-state evolution, and non-evolutionary strategies such as hill climbing and random search. The experiments confirm that random sampling effectively characterizes stochastic domains during genetic programming, provided that a sufficiently representative sample is used. An unexpected result is that genetic programming may perform worse than random search when the sampled training sets are exceptionally poor. We conjecture that poor training sets cause evolution to prematurely converge to undesirable optima, which irrevocably handicaps the population's diversity and viability.