论文信息 - A Generalized Reinforcement-Learning Model: Convergence and Applications - 字舞流文

A Generalized Reinforcement-Learning Model: Convergence and Applications

Reinforcement learning is the process by which an autonomous agent uses its experience interacting with an environment to improve its behavior. The Markov decision process (MDP) model is a popular way of formalizing the reinforcement-learning problem, but it is by no means the only way. In this paper, we show how many of the important theoretical results concerning reinforcement learning in MDPs extend to a generalized MDP model that includes MDPs, two-player games and MDPs under a worst-case optimality criterion as special cases. The basis of this extension is a stochastic-approximation theorem that reduces asynchronous convergence to synchronous convergence. Keywords: Reinforcement learning, Q-learning convergence, Markov games

Csaba Szepesvári | Michael L. Littman | Csaba Szepesvari | M. Littman

[1] L. Shapley,et al. Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[2] Cyrus Derman,et al. Finite State Markovian Decision Processes , 1970 .

[3] John N. Tsitsiklis,et al. Parallel and distributed computation , 1989 .

[4] A. Barto,et al. Learning and Sequential Decision Making , 1989 .

[5] Richard E. Korf,et al. Real-Time Heuristic Search , 1990, Artif. Intell..

[6] Richard S. Sutton,et al. Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[7] Anne Condon,et al. The Complexity of Stochastic Games , 1992, Inf. Comput..

[8] C. Atkeson,et al. Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[9] Ronald J. Williams,et al. Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions , 1993 .

[10] Anton Schwartz,et al. A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[11] Andrew G. Barto,et al. Convergence of Indirect Adaptive Asynchronous Value Iteration Algorithms , 1993, NIPS.

[12] Michael I. Jordan,et al. MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[13] George H. John. When the Best Move Isn't Optimal: Q-learning with Exploration , 1994, AAAI.

[14] Sebastian Thrun,et al. Learning to Play the Game of Chess , 1994, NIPS.

[15] Michael L. Littman,et al. Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[16] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[17] Michael I. Jordan,et al. Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[18] Matthias Heger,et al. Consideration of Risk in Reinforcement Learning , 1994, ICML.

[19] Gerald Tesauro,et al. Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[20] Ben J. A. Kröse,et al. Learning from delayed rewards , 1995, Robotics Auton. Syst..

[21] Stuart J. Russell,et al. Approximating Optimal Policies for Partially Observable Stochastic Domains , 1995, IJCAI.

[22] Csaba Szepesvári,et al. General Framework for Reinforcement Learning , 1995 .

[23] Andrew G. Barto,et al. Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[24] Thomas G. Dietterich. What is machine learning? , 2020, Archives of Disease in Childhood.

[25] Matthias Heger. The Loss from Imperfect Value Functions in Expectation-Based and Minimax-Based Tasks , 1996, Machine Learning.

[26] Csaba Szepesv Ari,et al. Generalized Markov Decision Processes: Dynamic-programming and Reinforcement-learning Algorithms , 1996 .