Empirical Studies in Action Selection with Reinforcement Learning

To excel in challenging tasks, intelligent agents need sophisticated mechanisms for action selection: they need policies that dictate what action to take in each situation. Reinforcement learning (RL) algorithms are designed to learn such policies given only positive and negative rewards. Two contrasting approaches to RL that are currently in popular use are temporal difference (TD) methods, which learn value functions, and evolutionary methods, which optimize populations of candidate policies. Both approaches have had practical successes but few studies have directly compared them. Hence, there are no general guidelines describing their relative strengths and weaknesses. In addition, there has been little cross-collaboration, with few attempts to make them work together or to apply ideas from one to the other. In this article we aim to address these shortcomings via three empirical studies that compare these methods and investigate new ways of making them work together. First, we compare the two approaches in a benchmark task and identify variations of the task that isolate factors critical to the performance of each method. Second, we investigate ways to make evolutionary algorithms excel at on-line tasks by borrowing exploratory mechanisms traditionally used by TD methods. We present empirical results demonstrating a dramatic performance improvement. Third, we explore a novel way of making evolutionary and TD methods work together by using evolution to automatically discover good representations for TD function approximators. We present results demonstrating that this novel approach can outperform both TD and evolutionary methods alone.

[1]  HighWire Press Philosophical Transactions of the Royal Society of London , 1781, The London Medical Journal.

[2]  J. Baldwin A New Factor in Evolution , 1896, The American Naturalist.

[3]  R. Bellman A PROBLEM IN THE SEQUENTIAL DESIGN OF EXPERIMENTS , 1954 .

[4]  John Holland,et al.  Adaptation in Natural and Artificial Sys-tems: An Introductory Analysis with Applications to Biology , 1975 .

[5]  J. Mason,et al.  Algorithms for approximation , 1987 .

[6]  M. J. D. Powell,et al.  Radial basis functions for multivariable interpolation: a review , 1987 .

[7]  Geoffrey E. Hinton,et al.  How Learning Can Guide Evolution , 1996, Complex Syst..

[8]  C. Watkins Learning from delayed rewards , 1989 .

[9]  David H. Ackley,et al.  Interactions between learning and evolution , 1991 .

[10]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[11]  Leslie Pack Kaelbling,et al.  Learning in embedded systems , 1993 .

[12]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[13]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[14]  Jeffrey L. Elman,et al.  Learning and Evolution in Neural Networks , 1994, Adapt. Behav..

[15]  Ida G. Sprinkhuizen-Kuyper,et al.  Evolving Artificial Neural Networks using the "Baldwin Effect" † , 1995 .

[16]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[17]  Jordan B. Pollack,et al.  Coevolution of a Backgammon Player , 1996 .

[18]  Risto Miikkulainen,et al.  Efficient Reinforcement Learning through Symbiotic Evolution , 1996, Machine Learning.

[19]  Charles W. Anderson,et al.  Comparison of CMACs and radial basis functions for local function approximators in reinforcement learning , 1997, Proceedings of International Conference on Neural Networks (ICNN'97).

[20]  Risto Miikkulainen,et al.  Culling and Teaching in Neuro-Evolution , 1997, ICGA.

[21]  Alan F. Murray,et al.  IEEE International Conference on Neural Networks , 1997 .

[22]  Andrew W. Moore,et al.  Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[23]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[24]  David H. Wolpert,et al.  Bandit problems and the exploration/exploitation tradeoff , 1998, IEEE Trans. Evol. Comput..

[25]  Peter Stagge,et al.  Averaging Efficiently in the Presence of Noise , 1998, PPSN.

[26]  Larry D. Pyeatt,et al.  Decision Tree Function Approximation in Reinforcement Learning , 1999 .

[27]  Xin Yao,et al.  Evolving artificial neural networks , 1999, Proc. IEEE.

[28]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[29]  John J. Grefenstette,et al.  Evolutionary Algorithms for Reinforcement Learning , 1999, J. Artif. Intell. Res..

[30]  Peter Sollich,et al.  Advances in neural information processing systems 11 , 1999 .

[31]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[32]  Leslie Pack Kaelbling,et al.  Practical Reinforcement Learning in Continuous Spaces , 2000, ICML.

[33]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[34]  Huosheng Hu,et al.  KaBaGe-RL: Kanerva-based generalisation and reinforcement learning for possession football , 2001, Proceedings 2001 IEEE/RSJ International Conference on Intelligent Robots and Systems. Expanding the Societal Role of Robotics in the the Next Millennium (Cat. No.01CH37180).

[35]  Steven M. Gustafson,et al.  Genetic Programming And Multi-agent Layered Learning By Reinforcements , 2002, GECCO.

[36]  Christophe G. Giraud-Carrier,et al.  Unifying Learning with Evolution Through Baldwinian Evolution and Lamarckism , 2000, Advances in Computational Intelligence and Learning.

[37]  Risto Miikkulainen,et al.  Evolving Neural Networks through Augmenting Topologies , 2002, Evolutionary Computation.

[38]  Sandor Markon,et al.  Threshold selection, hypothesis tests, and DOE methods , 2002, Proceedings of the 2002 Congress on Evolutionary Computation. CEC'02 (Cat. No.02TH8600).

[39]  Risto Miikkulainen,et al.  Evolving adaptive neural networks with and without adaptive synapses , 2003, The 2003 Congress on Evolutionary Computation, 2003. CEC '03..

[40]  Jeffrey O. Kephart,et al.  The Vision of Autonomic Computing , 2003, Computer.

[41]  Risto Miikkulainen,et al.  Evolving Keepaway Soccer Players through Task Decomposition , 2003, GECCO.

[42]  Doina Precup,et al.  Combining TD-learning with Cascade-correlation Networks , 2003, ICML.

[43]  Risto Miikkulainen,et al.  Robust non-linear control through neuroevolution , 2003 .

[44]  L. D. Whitley Genetic reinforcement learning for neurocontrol problems , 1993, Machine Learning.

[45]  Keith L. Downing,et al.  Reinforced Genetic Programming , 2001, Genetic Programming and Evolvable Machines.

[46]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[47]  Richard S. Sutton,et al.  Reinforcement learning with replacing eligibility traces , 2004, Machine Learning.

[48]  Risto Miikkulainen,et al.  Competitive Coevolution through Evolutionary Complexification , 2011, J. Artif. Intell. Res..

[49]  Rajarshi Das,et al.  Utility functions in autonomic systems , 2004, International Conference on Autonomic Computing, 2004. Proceedings..

[50]  Peter Stone,et al.  Machine Learning for Fast Quadrupedal Locomotion , 2004, AAAI.

[51]  Andrew G. Barto,et al.  Elevator Group Control Using Multiple Reinforcement Learning Agents , 1998, Machine Learning.

[52]  Risto Miikkulainen,et al.  Evolving a Roving Eye for Go , 2004, GECCO.

[53]  Longxin Lin Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching , 2004, Machine Learning.

[54]  Peter Stone,et al.  Function Approximation via Tile Coding: Automating Parameter Choice , 2005, SARA.

[55]  Sridhar Mahadevan,et al.  Samuel Meets Amarel: Automating Value Function Approximation Using Global State Space Analysis , 2005, AAAI.

[56]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[57]  Peter Stone,et al.  Reinforcement Learning for RoboCup Soccer Keepaway , 2005, Adapt. Behav..

[58]  Jean-Arcady Meyer,et al.  Adaptive Behavior , 2005 .

[59]  Risto Miikkulainen,et al.  Evolving Soccer Keepaway Players Through Task Decomposition , 2005, Machine Learning.

[60]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[61]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[62]  Simon M. Lucas,et al.  Coevolution versus self-play temporal difference learning for acquiring position evaluation in small-board go , 2005, IEEE Transactions on Evolutionary Computation.

[63]  Peter Stone,et al.  Keepaway Soccer: From Machine Learning Testbed to Benchmark , 2005, RoboCup.

[64]  Jürgen Schmidhuber,et al.  Co-evolving recurrent neurons learn deep memory POMDPs , 2005, GECCO '05.

[65]  L. Buşoniu Evolutionary function approximation for reinforcement learning , 2006 .

[66]  Shimon Whiteson,et al.  Comparing evolutionary and temporal difference methods in a reinforcement learning domain , 2006, GECCO.

[67]  Shimon Whiteson,et al.  Sample-Efficient Evolutionary Function Approximation for Reinforcement Learning , 2006, AAAI.

[68]  T. Prescott,et al.  Introduction. Modelling natural action selection , 2007, Philosophical Transactions of the Royal Society B: Biological Sciences.