论文信息 - Evolutionary Function Approximation for Reinforcement Learning

Evolutionary Function Approximation for Reinforcement Learning

Temporal difference methods are theoretically grounded and empirically effective methods for addressing reinforcement learning problems. In most real-world reinforcement learning tasks, TD methods require a function approximator to represent the value function. However, using function approximators requires manually making crucial representational decisions. This paper investigates evolutionary function approximation, a novel approach to automatically selecting function approximator representations that enable efficient individual learning. This method evolves individuals that are better able to learn. We present a fully implemented instantiation of evolutionary function approximation which combines NEAT, a neuroevolutionary optimization technique, with Q-learning, a popular TD method. The resulting NEAT+Q algorithm automatically discovers effective representations for neural network function approximators. This paper also presents on-line evolutionary computation, which improves the on-line performance of evolutionary computation by borrowing selection mechanisms used in TD methods to choose individual actions and using them in evolutionary computation to select policies for evaluation. We evaluate these contributions with extended empirical studies in two domains: 1) the mountain car task, a standard reinforcement learning benchmark on which neural network function approximators have previously performed poorly and 2) server job scheduling, a large probabilistic domain drawn from the field of autonomic computing. The results demonstrate that evolutionary function approximation can significantly improve the performance of TD methods and on-line evolutionary computation can significantly improve evolutionary methods. This paper also presents additional tests that offer insight into what factors can make neural network function approximation difficult in practice.

Shimon Whiteson | Peter Stone | S. Whiteson | P. Stone | Shimon Whiteson

[1] J. Baldwin. A New Factor in Evolution , 1896, The American Naturalist.

[2] R. Bellman. A PROBLEM IN THE SEQUENTIAL DESIGN OF EXPERIMENTS , 1954 .

[3] John Holland,et al. Adaptation in Natural and Artificial Sys-tems: An Introductory Analysis with Applications to Biology , 1975 .

[4] Geoffrey E. Hinton,et al. Learning internal representations by error propagation , 1986 .

[5] David E. Goldberg,et al. Genetic Algorithms with Sharing for Multimodalfunction Optimization , 1987, ICGA.

[6] Geoffrey E. Hinton,et al. How Learning Can Guide Evolution , 1996, Complex Syst..

[7] David E. Goldberg,et al. Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[8] James W. Layland,et al. Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment , 1989, JACM.

[9] C. Watkins. Learning from delayed rewards , 1989 .

[10] Terence C. Fogarty,et al. An Incremental Genetic Algorithm for Real-Time Learning , 1989, ML.

[11] D. E. Goldberg,et al. Genetic Algorithms in Search , 1989 .

[12] David H. Ackley,et al. Interactions between learning and evolution , 1991 .

[13] John H. Holland,et al. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[14] Leslie Pack Kaelbling,et al. Learning in embedded systems , 1993 .

[15] L. Darrell Whitley,et al. Adding Learning to the Cellular Development of Neural Networks: Evolution and the Baldwin Effect , 1993, Evolutionary Computation.

[16] Michael L. Littman,et al. Packet Routing in Dynamically Changing Networks: A Reinforcement Learning Approach , 1993, NIPS.

[17] Andrew W. Moore,et al. Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[18] Gerald Tesauro,et al. TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[19] R. French,et al. Genes, Phenes and the Baldwin Effect: Learning and Evolution in a Simulated Population , 1994 .

[20] L. Darrell Whitley,et al. Lamarckian Evolution, The Baldwin Effect and Function Optimization , 1994, PPSN.

[21] Jeffrey L. Elman,et al. Learning and Evolution in Neural Networks , 1994, Adapt. Behav..

[22] Ida G. Sprinkhuizen-Kuyper,et al. Evolving Artificial Neural Networks using the "Baldwin Effect" † , 1995 .

[23] Andrew McCallum,et al. Instance-Based Utile Distinctions for Reinforcement Learning , 1995 .

[24] J. V. Mieghem. Dynamic Scheduling with Convex Delay Costs: The Generalized CU Rule , 1995 .

[25] Wei Zhang,et al. A Reinforcement Learning Approach to job-shop Scheduling , 1995, IJCAI.

[26] Richard S. Sutton,et al. Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[27] Leemon C. Baird,et al. Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[28] Larry D. Pyeatt,et al. A comparison between cellular encoding and direct encoding for genetic neural networks , 1996 .

[29] Charles W. Anderson,et al. Comparison of CMACs and radial basis functions for local function approximators in reinforcement learning , 1997, Proceedings of International Conference on Neural Networks (ICNN'97).

[30] Risto Miikkulainen,et al. Culling and Teaching in Neuro-Evolution , 1997, ICGA.

[31] Ashwin Ram,et al. Experiments with Reinforcement Learning in Problems with Continuous State and Action Spaces , 1997, Adapt. Behav..

[32] Andrew W. Moore,et al. Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[33] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[34] Mark S. Fox,et al. Intelligent Scheduling , 1998 .

[35] David H. Wolpert,et al. Bandit problems and the exploration/exploitation tradeoff , 1998, IEEE Trans. Evol. Comput..

[36] Peter Stagge,et al. Averaging Efficiently in the Presence of Noise , 1998, PPSN.

[37] Larry D. Pyeatt,et al. Decision Tree Function Approximation in Reinforcement Learning , 1999 .

[38] Takahiro Sasaki,et al. Evolving Learnable Neural Networks Under Changing Environments with Various Rates of Inheritance of Acquired Characters: Comparison of Darwinian and Lamarckian Evolution , 1999, Artificial Life.

[39] Xin Yao,et al. Evolving artificial neural networks , 1999, Proc. IEEE.

[40] John N. Tsitsiklis,et al. Actor-Critic Algorithms , 1999, NIPS.

[41] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[42] Stewart W. Wilson,et al. Learning Classifier Systems, From Foundations to Applications , 2000 .

[43] Leslie Pack Kaelbling,et al. Practical Reinforcement Learning in Continuous Spaces , 2000, ICML.

[44] Michael I. Jordan,et al. PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[45] Risto Miikkulainen,et al. A neuro-evolution method for dynamic resource allocation on a chip multiprocessor , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[46] Francisco B. Pereira,et al. Understanding the role of learning in the evolution of busy beavers: a comparison between the baldwin effect and a Lamarckian strategy , 2001 .

[47] Isaac Meilijson,et al. Evolution of reinforcement learning in foraging bees: a simple explanation for risk averse behavior , 2002, Neurocomputing.

[48] Andrew James Smith,et al. Applications of the self-organising map to reinforcement learning , 2002, Neural Networks.

[49] Martin V. Butz,et al. An algorithmic description of XCS , 2000, Soft Comput..

[50] Christophe G. Giraud-Carrier,et al. Unifying Learning with Evolution Through Baldwinian Evolution and Lamarckism , 2000, Advances in Computational Intelligence and Learning.

[51] Risto Miikkulainen,et al. Evolving Neural Networks through Augmenting Topologies , 2002, Evolutionary Computation.

[52] Sandor Markon,et al. Threshold selection, hypothesis tests, and DOE methods , 2002, Proceedings of the 2002 Congress on Evolutionary Computation. CEC'02 (Cat. No.02TH8600).

[53] Risto Miikkulainen,et al. Evolving adaptive neural networks with and without adaptive synapses , 2003, The 2003 Congress on Evolutionary Computation, 2003. CEC '03..

[54] Jeffrey O. Kephart,et al. The Vision of Autonomic Computing , 2003, Computer.

[55] Shie Mannor,et al. The Cross Entropy Method for Fast Policy Search , 2003, ICML.

[56] Doina Precup,et al. Combining TD-learning with Cascade-correlation Networks , 2003, ICML.

[57] Michail G. Lagoudakis,et al. Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[58] Andrew G. Barto,et al. Building a Basic Block Instruction Scheduler with Reinforcement Learning and Rollouts , 2002, Machine Learning.

[59] Keith L. Downing,et al. Reinforced Genetic Programming , 2001, Genetic Programming and Evolvable Machines.

[60] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[61] Mor Harchol-Balter,et al. Priority mechanisms for OLTP and transactional Web applications , 2004, Proceedings. 20th International Conference on Data Engineering.

[62] Risto Miikkulainen,et al. Competitive Coevolution through Evolutionary Complexification , 2011, J. Artif. Intell. Res..

[63] Rajarshi Das,et al. Utility functions in autonomic systems , 2004 .

[64] Rajarshi Das,et al. Utility functions in autonomic systems , 2004, International Conference on Autonomic Computing, 2004. Proceedings..

[65] Takaya Arita,et al. Interactions between learning and evolution: the outstanding strategy generated by the Baldwin effect. , 2004, Bio Systems.

[66] Peter Stone,et al. Machine Learning for Fast Quadrupedal Locomotion , 2004, AAAI.

[67] Pedro M. Domingos,et al. Adversarial classification , 2004, KDD.

[68] Andrew G. Barto,et al. Elevator Group Control Using Multiple Reinforcement Learning Agents , 1998, Machine Learning.

[69] Shimon Whiteson,et al. Adaptive job routing and scheduling , 2004, Eng. Appl. Artif. Intell..

[70] Risto Miikkulainen,et al. Evolving a Roving Eye for Go , 2004, GECCO.

[71] Longxin Lin. Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching , 2004, Machine Learning.

[72] Peter Stone,et al. Function Approximation via Tile Coding: Automating Parameter Choice , 2005, SARA.

[73] Sridhar Mahadevan,et al. Samuel Meets Amarel: Automating Value Function Approximation Using Global State Space Analysis , 2005, AAAI.

[74] Martin A. Riedmiller. Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[75] Michael Dahlin,et al. Towards Self-Configuring Hardware for Distributed Computer Systems , 2005, Second International Conference on Autonomic Computing (ICAC'05).

[76] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[77] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[78] Kenneth O. Stanley and Bobby D. Bryant and Risto Miikkulainen,et al. Real-Time Evolution in the NERO Video Game (Winner of CIG 2005 Best Paper Award) , 2005, CIG.

[79] Nicholas J. Radcliffe,et al. Genetic set recombination and its application to neural network topology optimisation , 1993, Neural Computing & Applications.

[80] L. Buşoniu. Evolutionary function approximation for reinforcement learning , 2006 .