A Tutorial on Reinforcement Learning Techniques

Reinforcement Learning (RL) is learning through direct experimentation. It does not assume the existence of a teacher that provides examples upon which learning of a task takes place. Instead, in RL experience is the only teacher. With historical roots on the study of conditioned reflexes, RL soon attracted the interest of Engineers and Computer Scientists because of its theoretical relevance and potential applications in fields as diverse as Operational Research and Robotics. Computationally, RL is intended to operate in a learning environment composed by two subjects: the learner and a dynamic process. At successive time steps, the learner makes an observation of the process state, selects an action and applies it back to the process. The goal of the learner is to find out an action policy that controls the behavior of this dynamic process, guided by signals (reinforcements) that indicate how well it is performing the required task. These signals are usually associated to some dramatic condition — e.g., accomplishment of a subtask (reward) or complete failure (punishment), and the learner’s goal is to optimize its behavior based on some performance measure (a function of the received reinforcements). The crucial point is that in order to do that, the learner must evaluate the conditions (associations between observed states and chosen actions) that lead to rewards or punishments. In other words, it must learn how to assign credit to past actions and states by correctly estimating costs associated to these events. Starting from basic concepts, this tutorial presents the many flavors of RL algorithms, develops the corresponding mathematical tools, assess their practical limitations and discusses alternatives that have been proposed for applying RL to realistic tasks, such as those involving large state spaces or partial observability. It relies on examples and diagrams to illustrate the main points, and provides many references to the specialized literature and to Internet sites where relevant demos and additional information can be obtained.

[1]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[2]  Gerald Tesauro,et al.  Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[3]  Claude Sammut,et al.  Controlling a steel mill with BOXES , 1996, Machine Intelligence 14.

[4]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[5]  Ian H. Witten,et al.  An Adaptive Optimal Controller for Discrete-Time Markov Environments , 1977, Inf. Control..

[6]  R. Andrew McCallum,et al.  Hidden state and reinforcement learning with instance-based state identification , 1996, IEEE Trans. Syst. Man Cybern. Part B.

[7]  P. B. Coaker,et al.  Applied Dynamic Programming , 1964 .

[8]  Hilbert J. Kappen,et al.  LEARNING ACTIVE VISION , 1998 .

[9]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[10]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[11]  Leslie Pack Kaelbling,et al.  Input Generalization in Delayed Reinforcement Learning: An Algorithm and Performance Comparisons , 1991, IJCAI.

[12]  José del R. Millán,et al.  Rapid, safe, and incremental learning of navigation strategies , 1996, IEEE Trans. Syst. Man Cybern. Part B.

[13]  Maja J. Matarić A Comparative Analysis of Reinforcement Learning Methods , 1991 .

[14]  Marco Saerens,et al.  A neural controller , 1989 .

[15]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[16]  Katsuhiko Ogata Designing Linear Control Systems with MATLAB , 1993 .

[17]  Csaba Szepesvari Static and Dynamic Aspects of Optimal Sequential Decision Making , 1998 .

[18]  Andrew McCallum,et al.  Reinforcement learning with selective perception and hidden state , 1996 .

[19]  Bernard Widrow,et al.  Punish/Reward: Learning with a Critic in Adaptive Threshold Systems , 1973, IEEE Trans. Syst. Man Cybern..

[20]  John H. Holland,et al.  Cognitive systems based on adaptive algorithms , 1977, SGAR.

[21]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[22]  R. A. McCallum First Results with Utile Distinction Memory for Reinforcement Learning , 1992 .

[23]  Rodney A. Brooks,et al.  Real Robots, Real Learning Problems , 1993 .

[24]  Marco Colombetti,et al.  Robot Shaping: Developing Autonomous Agents Through Learning , 1994, Artif. Intell..

[25]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[26]  C. Striebel Sufficient statistics in the optimum control of stochastic systems , 1965 .

[27]  A. Barto,et al.  Learning and Sequential Decision Making , 1989 .

[28]  Andrew G. Barto,et al.  Large-scale dynamic optimization using teams of reinforcement learning agents , 1996 .

[29]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[30]  Sridhar Mahadevan,et al.  Automatic Programming of Behavior-Based Robots Using Reinforcement Learning , 1991, Artif. Intell..

[31]  Lonnie Chrisman,et al.  Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach , 1992, AAAI.

[32]  Maja J. Matarić,et al.  A Distributed Model for Mobile Robot Environment-Learning and Navigation , 1990 .

[33]  Dimitri P. Bertsekas,et al.  A Counterexample to Temporal Differences Learning , 1995, Neural Computation.

[34]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[35]  Richard S. Sutton,et al.  Time-Derivative Models of Pavlovian Reinforcement , 1990 .

[36]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[37]  W. T. Miller,et al.  CMAC: an associative neural network alternative to backpropagation , 1990, Proc. IEEE.

[38]  Dana H. Ballard,et al.  Active Perception and Reinforcement Learning , 1990, Neural Computation.

[39]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[40]  Satinder Singh,et al.  Learning to Solve Markovian Decision Processes , 1993 .

[41]  M. Gabriel,et al.  Learning and Computational Neuroscience: Foundations of Adaptive Networks , 1990 .

[42]  Richard S. Sutton,et al.  Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[43]  Long Ji Lin,et al.  Reinforcement Learning of Non-Markov Decision Processes , 1995, Artif. Intell..

[44]  Erann Gat,et al.  Behavior control for robotic exploration of planetary surfaces , 1994, IEEE Trans. Robotics Autom..

[45]  Carlos H. C. Ribeiro,et al.  Embedding a Priori Knowledge in Reinforcement Learning , 1998, J. Intell. Robotic Syst..

[46]  Claude Sammut,et al.  Recent progress with BOXES , 1994, Machine Intelligence 13.

[47]  Long-Ji Lin,et al.  Self-improving reactive agents: case studies of reinforcement learning frameworks , 1991 .

[48]  Rodney A. Brooks,et al.  Elephants don't play chess , 1990, Robotics Auton. Syst..

[49]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[50]  Michael I. Jordan,et al.  Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[51]  Dimitri P. Bertsekas,et al.  Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems , 1996, NIPS.

[52]  Long Lin,et al.  Memory Approaches to Reinforcement Learning in Non-Markovian Domains , 1992 .

[53]  Csaba Szepesv Ari,et al.  Generalized Markov Decision Processes: Dynamic-programming and Reinforcement-learning Algorithms , 1996 .