论文信息 - A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms - 字舞流文

A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms

Reinforcement learning is the problem of generating optimal behavior in a sequential decision-making environment given the opportunity of interacting with it. Many algorithms for solving reinforcement-learning problems work by computing improved estimates of the optimal value function. We extend prior analyses of reinforcement-learning algorithms and present a powerful new theorem that can provide a unified analysis of such value-function-based reinforcement-learning algorithms. The usefulness of the theorem lies in how it allows the convergence of a complex asynchronous reinforcement-learning algorithm to be proved by verifying that a simpler synchronous algorithm converges. We illustrate the application of the theorem by analyzing the convergence of Q-learning, model-based reinforcement learning, Q-learning with multistate updates, Q-learning for Markov games, and risk-sensitive reinforcement learning.

Csaba Szepesvári | Michael L. Littman | Csaba Szepesvari | M. Littman

[1] Stephen Grossberg,et al. Embedding fields: A theory of learning with physiological implications , 1969 .

[2] Harold J. Kushner,et al. wchastic. approximation methods for constrained and unconstrained systems , 1978 .

[3] Michel Installe,et al. Stochastic approximation methods , 1978 .

[4] Stef Tijs,et al. Fictitious play applied to sequences of games and discounted stochastic games , 1982 .

[5] Paul J. Schweitzer,et al. Aggregation Methods for Large Markov Chains , 1983, Computer Performance and Reliability.

[6] G. Owen,et al. Game Theory (2nd Ed.). , 1983 .

[7] H. Robbins,et al. A Convergence Theorem for Non Negative Almost Supermartingales and Some Applications , 1985 .

[8] M. Kurano. LEARNING ALGORITHMS FOR MARKOV DECISION PROCESSES , 1987 .

[9] C. Watkins. Learning from delayed rewards , 1989 .

[10] John N. Tsitsiklis,et al. Parallel and distributed computation , 1989 .

[11] D. Bertsekas,et al. Adaptive aggregation methods for infinite horizon dynamic programming , 1989 .

[12] A. Barto,et al. Learning and Sequential Decision Making , 1989 .

[13] Richard S. Sutton,et al. Learning and Sequential Decision Making , 1989 .

[14] Richard E. Korf,et al. Real-Time Heuristic Search , 1990, Artif. Intell..

[15] Pierre Priouret,et al. Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[16] M. Gabriel,et al. Learning and Computational Neuroscience: Foundations of Adaptive Networks , 1990 .

[17] C. Atkeson,et al. Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[18] Anton Schwartz,et al. A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[19] Andrew G. Barto,et al. Convergence of Indirect Adaptive Asynchronous Value Iteration Algorithms , 1993, NIPS.

[20] Ronald J. Williams,et al. Analysis of Some Incremental Variants of Policy Iteration: First Steps Toward Understanding Actor-Cr , 1993 .

[21] Michael I. Jordan,et al. MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[22] George H. John. When the Best Move Isn't Optimal: Q-learning with Exploration , 1994, AAAI.

[23] Mahesan Niranjan,et al. On-line Q-learning using connectionist systems , 1994 .

[24] Michael L. Littman,et al. Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[25] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[26] Michael I. Jordan,et al. Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[27] Matthias Heger,et al. Consideration of Risk in Reinforcement Learning , 1994, ICML.

[28] Carlos H. C. Ribeiro. Attentional Mechanisms as a Strategy for Generalization in the Q-Learning Algorithm , 1995 .

[29] Geoffrey J. Gordon. Stable Function Approximation in Dynamic Programming , 1995, ICML.

[30] Andrew G. Barto,et al. Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[31] Andrew W. Moore,et al. Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[32] Csaba Szepesvári,et al. A Generalized Reinforcement-Learning Model: Convergence and Applications , 1996, ICML.

[33] Richard S. Sutton,et al. Reinforcement Learning with Replacing Eligibility Traces , 2005, Machine Learning.

[34] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[35] Thomas G. Dietterich. What is machine learning? , 2020, Archives of Disease in Childhood.

[36] Michael L. Littman,et al. Algorithms for Sequential Decision Making , 1996 .

[37] Leon A. Petrosyan,et al. Game Theory (Second Edition) , 1996 .

[38] Csaba Szepesv Ari,et al. Generalized Markov Decision Processes: Dynamic-programming and Reinforcement-learning Algorithms , 1996 .

[39] Harold J. Kushner,et al. Stochastic Approximation Algorithms and Applications , 1997, Applications of Mathematics.

[40] Csaba Szepesvári,et al. The Asymptotic Convergence-Rate of Q-learning , 1997, NIPS.

[41] B. Kermanshahi,et al. Multiagent reinforcement learning , 1998 .

[42] Csaba Szepesvari. Static and Dynamic Aspects of Optimal Sequential Decision Making , 1998 .

[43] Michael P. Wellman,et al. Multiagent Reinforcement Learning: Theoretical Framework and an Algorithm , 1998, ICML.

[44] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[45] Vivek S. Borkar,et al. Actor-Critic - Type Learning Algorithms for Markov Decision Processes , 1999, SIAM J. Control. Optim..

[46] Tamer Basar,et al. Analysis of Recursive Stochastic Algorithms , 2001 .