Reinforcement Learning: A Survey

This paper surveys the field of reinforcement learning from a computer-science perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trial-and-error interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word "reinforcement." The paper discusses central issues of reinforcement learning, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning.

[1]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[2]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[3]  F. d'Epenoux,et al.  A Probabilistic Production and Inventory Problem , 1963 .

[4]  R. Karp,et al.  On Nonterminating Stochastic Games , 1966 .

[5]  Cyrus Derman,et al.  Finite State Markovian Decision Processes , 1970 .

[6]  Kumpati S. Narendra,et al.  Learning Automata - A Survey , 1974, IEEE Trans. Syst. Man Cybern..

[7]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[8]  James S. Albus,et al.  New Approach to Manipulator Control: The Cerebellar Model Articulation Controller (CMAC)1 , 1975 .

[9]  Raymond J. Bandlow Theories of Learning, 4th Edition. By Ernest R. Hilgard and Gordon H. Bower. Englewood Cliffs, N.J.: Prentice-Hall, Inc., 1975 , 1976 .

[10]  M. Puterman,et al.  Modified Policy Iteration Algorithms for Discounted Markov Decision Problems , 1978 .

[11]  G. Siouris,et al.  Optimum systems control , 1979, Proceedings of the IEEE.

[12]  Alexander Graham,et al.  Introduction to Control Theory, Including Optimal Control , 1980 .

[13]  R. Mortensen Introduction to Control Theory, Including Optimal Control (David Burghes and Alexander Graham) , 1982 .

[14]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[15]  R.M. Dunn,et al.  Brains, behavior, and robotics , 1983, Proceedings of the IEEE.

[16]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[17]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[18]  Charles W. Anderson,et al.  Learning and problem-solving with multilayer connectionist systems (adaptive, strategy learning, neural networks, reinforcement learning) , 1986 .

[19]  James L. McClelland,et al.  James L. McClelland, David Rumelhart and the PDP Research Group, Parallel distributed processing: explorations in the microstructure of cognition . Vol. 1. Foundations . Vol. 2. Psychological and biological models . Cambridge MA: M.I.T. Press, 1987. , 1989, Journal of Child Language.

[20]  Pravin Varaiya,et al.  Stochastic Systems: Estimation, Identification, and Adaptive Control , 1986 .

[21]  R. Stengel Stochastic Optimal Control: Theory and Application , 1986 .

[22]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[23]  Dimitri P. Bertsekas,et al.  Dynamic Programming: Deterministic and Stochastic Models , 1987 .

[24]  Ullrich Rüde Mathematical and Computational Techniques for Multilevel Adaptive Methods , 1987 .

[25]  George E. P. Box,et al.  Empirical Model‐Building and Response Surfaces , 1988 .

[26]  P. W. Jones,et al.  Bandit Problems, Sequential Allocation of Experiments , 1987 .

[27]  David K. Smith Theory of Linear and Integer Programming , 1987 .

[28]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[29]  W. Cleveland,et al.  Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting , 1988 .

[30]  Kumpati S. Narendra,et al.  Learning automata - an introduction , 1989 .

[31]  C. Watkins Learning from delayed rewards , 1989 .

[32]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[33]  D. Bertsekas,et al.  Adaptive aggregation methods for infinite horizon dynamic programming , 1989 .

[34]  Christian M. Ernst,et al.  Multi-armed Bandit Allocation Indices , 1989 .

[35]  David H. Ackley,et al.  Generalization and Scaling in Reinforcement Learning , 1989, NIPS.

[36]  Jürgen Schmidhuber,et al.  Reinforcement Learning in Markovian and Non-Markovian Environments , 1990, NIPS.

[37]  Rodney A. Brooks,et al.  Learning to Coordinate Behaviors , 1990, AAAI.

[38]  Vijaykumar Gullapalli,et al.  A stochastic reinforcement learning algorithm for learning real-valued functions , 1990, Neural Networks.

[39]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[40]  A. Moore Variable Resolution Dynamic Programming , 1991, ML.

[41]  Hamid R. Berenji Artificial Neural Networks and Approximate Reasoning for Intelligent Control in Space , 1991 .

[42]  Leslie Pack Kaelbling,et al.  Input Generalization in Delayed Reinforcement Learning: An Algorithm and Performance Comparisons , 1991, IJCAI.

[43]  Sridhar Mahadevan,et al.  Scaling Reinforcement Learning to Robotics by Exploiting the Subsumption Architecture , 1991, ML.

[44]  Chuen-Chien Lee,et al.  A self‐learning rule‐based controller employing approximate reasoning and neural net concepts , 1991, Int. J. Intell. Syst..

[45]  Steven D. Whitehead,et al.  Complexity and Cooperation in Q-Learning , 1991, ML.

[46]  Long Ji Lin,et al.  Programming Robots Using Reinforcement Learning and Teaching , 1991, AAAI.

[47]  Richard S. Sutton,et al.  Planning by Incremental Dynamic Programming , 1991, ML.

[48]  Christopher G. Atkeson,et al.  Memory-Based Learning Control , 1991, 1991 American Control Conference.

[49]  Jürgen Schmidhuber,et al.  Curious model-building control systems , 1991, [Proceedings] 1991 IEEE International Joint Conference on Neural Networks.

[50]  H. Berenji Artificial Neural Networks and Approximate Reasoning for Intelligent Control in Space , 1991, 1991 American Control Conference.

[51]  W. Lovejoy A survey of algorithmic methods for partially observed Markov decision processes , 1991 .

[52]  Anne Condon,et al.  The Complexity of Stochastic Games , 1992, Inf. Comput..

[53]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[54]  Long-Ji Lin,et al.  Reinforcement learning for robots using neural networks , 1992 .

[55]  Gerald Tesauro,et al.  Practical Issues in Temporal Difference Learning , 1992, Mach. Learn..

[56]  Sridhar Mahadevan,et al.  Automatic Programming of Behavior-Based Robots Using Reinforcement Learning , 1991, Artif. Intell..

[57]  Lonnie Chrisman,et al.  Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach , 1992, AAAI.

[58]  D. Sofge THE ROLE OF EXPLORATION IN LEARNING CONTROL , 1992 .

[59]  Satinder P. Singh,et al.  Reinforcement Learning with a Hierarchy of Abstract Models , 1992, AAAI.

[60]  Vijaykumar Gullapalli,et al.  Reinforcement learning and its application to control , 1992 .

[61]  Long Lin,et al.  Memory Approaches to Reinforcement Learning in Non-Markovian Domains , 1992 .

[62]  C. Atkeson,et al.  Prioritized Sweeping : Reinforcement Learning withLess Data and Less Real , 1993 .

[63]  Paul M. B. Vitányi,et al.  Theories of learning , 2007 .

[64]  L.-J. Lin,et al.  Hierarchical learning of robot skills by reinforcement , 1993, IEEE International Conference on Neural Networks.

[65]  Leslie Pack Kaelbling,et al.  Learning in embedded systems , 1993 .

[66]  Roderic A. Grupen,et al.  Robust Reinforcement Learning in Motion Planning , 1993, NIPS.

[67]  Leslie Pack Kaelbling,et al.  Planning With Deadlines in Stochastic Domains , 1993, AAAI.

[68]  Terrence J. Sejnowski,et al.  Temporal Difference Learning of Position Evaluation in the Game of Go , 1993, NIPS.

[69]  Ronald J. Williams,et al.  Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions , 1993 .

[70]  Reid G. Simmons,et al.  Complexity Analysis of Real-Time Reinforcement Learning , 1993, AAAI.

[71]  Anton Schwartz,et al.  A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[72]  Andrew McCallum,et al.  Overcoming Incomplete Perception with Utile Distinction Memory , 1993, ICML.

[73]  Ronald J. Williams,et al.  Analysis of Some Incremental Variants of Policy Iteration: First Steps Toward Understanding Actor-Cr , 1993 .

[74]  Sridhar Mahadevan,et al.  Rapid Task Learning for Real Robots , 1993 .

[75]  Satinder Singh,et al.  Learning to Solve Markovian Decision Processes , 1993 .

[76]  Jing Peng,et al.  Efficient Learning and Planning Within the Dyna Framework , 1993, Adapt. Behav..

[77]  Dean A. Pomerleau,et al.  Neural Network Perception for Mobile Robot Guidance , 1993 .

[78]  J. Peng,et al.  Efficient Learning and Planning Within the Dyna Framework , 1993, IEEE International Conference on Neural Networks.

[79]  Leslie Pack Kaelbling,et al.  Hierarchical Learning in Stochastic Domains: Preliminary Results , 1993, ICML.

[80]  VehicleLisa Meedeny,et al.  Emergent Control and Planning in an Autonomous , 1993 .

[81]  Leemon C Baird,et al.  Reinforcement Learning With High-Dimensional, Continuous Actions , 1993 .

[82]  Gary McGraw,et al.  Emergent Control and Planning in an Autonomous Vehicle , 1993 .

[83]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[84]  John Alan Kirman Predicting real-time planner performance by domain characterization , 1994 .

[85]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[86]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[87]  Leslie Pack Kaelbling,et al.  Acting Optimally in Partially Observable Stochastic Domains , 1994, AAAI.

[88]  Sridhar Mahadevan,et al.  To Discount or Not to Discount in Reinforcement Learning: A Case Study Comparing R Learning and Q Learning , 1994, ICML.

[89]  Sebastian Thrun,et al.  Learning to Play the Game of Chess , 1994, NIPS.

[90]  Michael L. Littman,et al.  Memoryless policies: theoretical limitations and practical results , 1994 .

[91]  Richard W. Prager,et al.  A Modular Q-Learning Architecture for Manipulator Task Decomposition , 1994, ICML.

[92]  Maja J. Mataric,et al.  Reward Functions for Accelerated Learning , 1994, ICML.

[93]  Marco Colombetti,et al.  Robot Shaping: Developing Autonomous Agents Through Learning , 1994, Artif. Intell..

[94]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[95]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[96]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[97]  C. Fiechter Eecient Reinforcement Learning , 1994 .

[98]  S. Schaal,et al.  Robot juggling: implementation of memory-based learning , 1994, IEEE Control Systems.

[99]  Dave Cliff,et al.  Adding Temporary Memory to ZCS , 1994, Adapt. Behav..

[100]  Marco Dorigo,et al.  A comparison of Q-learning and classifier systems , 1994 .

[101]  Claude-Nicolas Fiechter,et al.  Efficient reinforcement learning , 1994, COLT '94.

[102]  Gerald Tesauro,et al.  Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[103]  Mark B. Ring Continual learning in reinforcement environments , 1995, GMD-Bericht.

[104]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[105]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[106]  Andrew McCallum,et al.  Instance-Based Utile Distinctions for Reinforcement Learning , 1995 .

[107]  Marcos Salganicoff,et al.  Active Exploration and Learning in real-Valued Spaces using Multi-Armed Bandit Allocation Indices , 1995, ICML.

[108]  Leslie Pack Kaelbling,et al.  On the Complexity of Solving Markov Decision Problems , 1995, UAI.

[109]  Stewart W. Wilson Classifier Fitness Based on Accuracy , 1995, Evolutionary Computation.

[110]  M. Dorigo ALECSYS and the AutonoMouse: Learning to Control a Real Robot by Distributed Classifier Systems , 1995, Machine Learning.

[111]  J. Mulawka Fast and Eecient Reinforcement Learning with Truncated Temporal Diierences , 1995 .

[112]  Leslie Pack Kaelbling,et al.  Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[113]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[114]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[115]  Marco Dorigo,et al.  Alecsys and the AutonoMouse: Learning to control a real robot by distributed classifier systems , 2004, Machine Learning.

[116]  Pawel Cichosz,et al.  Fast and Efficient Reinforcement Learning with Truncated Temporal Differences , 1995, ICML.

[117]  Wei Zhang,et al.  A Reinforcement Learning Approach to job-shop Scheduling , 1995, IJCAI.

[118]  Richard S. Sutton,et al.  Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[119]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[120]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[121]  Andrew McCallum,et al.  Instance-Based Utile Distinctions for Reinforcement Learning with Hidden State , 1995, ICML.

[122]  José del R. Millán,et al.  Rapid, safe, and incremental learning of navigation strategies , 1996, IEEE Trans. Syst. Man Cybern. Part B.

[123]  Richard S. Sutton,et al.  Reinforcement Learning with Replacing Eligibility Traces , 2005, Machine Learning.

[124]  Andrew McCallum,et al.  Reinforcement learning with selective perception and hidden state , 1996 .

[125]  John Rust Numerical dynamic programming in economics , 1996 .

[126]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.