Learning to Control Forest Fires with ESP

Reinforcement Learning (Kaelbling et al., 1996) can be used to learn to control an agent by letting it interact with its environment. In general there are two kinds of reinforcement learning; (1) Value-function based reinforcement learning, which are based on the use of heuristic dynamic programming algorithms such as temporal difference learning (Sutton, 1988) and Q-learning (Watkins, 1989), and (2) Evolutionary algorithms such as genetic programming (Koza, 1992), Symbiotic Adaptive Neuron Evolution (SANE) (Moriarty & Miikkulainen, 1996), and Enforced SubPopulations (ESP) (Gomez & Miikkulainen, 1998). There is still an ongoing debate which of these algorithms works best for a particular problem. E.g. for learning to play games, often value-function based RL seems appropriate since the Markov assumption holds. E.g., Tesauro (1992) used temporal difference learning to let a program learn to play backgammon by playing against itself, and this led to human-expert level. However, for non-Markovian environments evolutionary approaches may sometimes be more beneficial.