A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms

Reinforcement learning is the problem of generating optimal behavior in a sequential decision-making environment given the opportunity of interacting with it. Many algorithms for solving reinforcement-learning problems work by computing improved estimates of the optimal value function. We extend prior analyses of reinforcement-learning algorithms and present a powerful new theorem that can provide a unified analysis of such value-function-based reinforcement-learning algorithms. The usefulness of the theorem lies in how it allows the convergence of a complex asynchronous reinforcement-learning algorithm to be proved by verifying that a simpler synchronous algorithm converges. We illustrate the application of the theorem by analyzing the convergence of Q-learning, model-based reinforcement learning, Q-learning with multistate updates, Q-learning for Markov games, and risk-sensitive reinforcement learning.

[1]  Stephen Grossberg,et al.  Embedding fields: A theory of learning with physiological implications , 1969 .

[2]  Harold J. Kushner,et al.  wchastic. approximation methods for constrained and unconstrained systems , 1978 .

[3]  Michel Installe,et al.  Stochastic approximation methods , 1978 .

[4]  Stef Tijs,et al.  Fictitious play applied to sequences of games and discounted stochastic games , 1982 .

[5]  Paul J. Schweitzer,et al.  Aggregation Methods for Large Markov Chains , 1983, Computer Performance and Reliability.

[6]  G. Owen,et al.  Game Theory (2nd Ed.). , 1983 .

[7]  H. Robbins,et al.  A Convergence Theorem for Non Negative Almost Supermartingales and Some Applications , 1985 .

[8]  M. Kurano LEARNING ALGORITHMS FOR MARKOV DECISION PROCESSES , 1987 .

[9]  C. Watkins Learning from delayed rewards , 1989 .

[10]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[11]  D. Bertsekas,et al.  Adaptive aggregation methods for infinite horizon dynamic programming , 1989 .

[12]  A. Barto,et al.  Learning and Sequential Decision Making , 1989 .

[13]  Richard S. Sutton,et al.  Learning and Sequential Decision Making , 1989 .

[14]  Richard E. Korf,et al.  Real-Time Heuristic Search , 1990, Artif. Intell..

[15]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[16]  M. Gabriel,et al.  Learning and Computational Neuroscience: Foundations of Adaptive Networks , 1990 .

[17]  C. Atkeson,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[18]  Anton Schwartz,et al.  A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[19]  Andrew G. Barto,et al.  Convergence of Indirect Adaptive Asynchronous Value Iteration Algorithms , 1993, NIPS.

[20]  Ronald J. Williams,et al.  Analysis of Some Incremental Variants of Policy Iteration: First Steps Toward Understanding Actor-Cr , 1993 .

[21]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[22]  George H. John When the Best Move Isn't Optimal: Q-learning with Exploration , 1994, AAAI.

[23]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[24]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[25]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[26]  Michael I. Jordan,et al.  Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[27]  Matthias Heger,et al.  Consideration of Risk in Reinforcement Learning , 1994, ICML.

[28]  Carlos H. C. Ribeiro Attentional Mechanisms as a Strategy for Generalization in the Q-Learning Algorithm , 1995 .

[29]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[30]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[31]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[32]  Csaba Szepesvári,et al.  A Generalized Reinforcement-Learning Model: Convergence and Applications , 1996, ICML.

[33]  Richard S. Sutton,et al.  Reinforcement Learning with Replacing Eligibility Traces , 2005, Machine Learning.

[34]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[35]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[36]  Michael L. Littman,et al.  Algorithms for Sequential Decision Making , 1996 .

[37]  Leon A. Petrosyan,et al.  Game Theory (Second Edition) , 1996 .

[38]  Csaba Szepesv Ari,et al.  Generalized Markov Decision Processes: Dynamic-programming and Reinforcement-learning Algorithms , 1996 .

[39]  Harold J. Kushner,et al.  Stochastic Approximation Algorithms and Applications , 1997, Applications of Mathematics.

[40]  Csaba Szepesvári,et al.  The Asymptotic Convergence-Rate of Q-learning , 1997, NIPS.

[41]  B. Kermanshahi,et al.  Multiagent reinforcement learning , 1998 .

[42]  Csaba Szepesvari Static and Dynamic Aspects of Optimal Sequential Decision Making , 1998 .

[43]  Michael P. Wellman,et al.  Multiagent Reinforcement Learning: Theoretical Framework and an Algorithm , 1998, ICML.

[44]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[45]  Vivek S. Borkar,et al.  Actor-Critic - Type Learning Algorithms for Markov Decision Processes , 1999, SIAM J. Control. Optim..

[46]  Tamer Basar,et al.  Analysis of Recursive Stochastic Algorithms , 2001 .