Generalized Markov Decision Processes: Dynamic-programming and Reinforcement-learning Algorithms

The problem of maximizing the expected total discounted reward in a completely observable Markovian environment, i.e., a Markov decision process (MDP), models a particular class of sequential decision problems. Algorithms have been developed for making optimal decisions in MDPs given either an MDP specification or the opportunity to interact with the MDP over time. Recently, other sequential decision-making problems have been studied prompting the development of new algorithms and analyses. We describe a new generalized model that subsumes MDPs as well as many of the recent variations. We prove some basic results concerning this model and develop generalizations of value iteration, policy iteration, model-based reinforcement-learning, and Q-learning that can be used to make optimal decisions in the generalized model under various assumptions. Applications of the theory to particular models are described, including risk-averse MDPs, exploration-sensitive MDPs, sarsa, Q-learning with spreading, two-player games, and approximate max picking via sampling. Central to the results are the contraction property of the value operator and a stochastic-approximation theorem that reduces asynchronous convergence to synchronous convergence.

[1]  L. Shapley,et al.  Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[2]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[3]  R. Karp,et al.  On Nonterminating Stochastic Games , 1966 .

[4]  Cyrus Derman,et al.  Finite State Markovian Decision Processes , 1970 .

[5]  D. Bertsekas Monotone Mappings with Application in Dynamic Programming , 1977 .

[6]  Robert M Thrall,et al.  Mathematics of Operations Research. , 1978 .

[7]  Stef Tijs,et al.  Fictitious play applied to sequences of games and discounted stochastic games , 1982 .

[8]  Patchigolla Kiran Kumar,et al.  A Survey of Some Results in Stochastic Adaptive Control , 1985 .

[9]  Karl-Heinz Waldmann,et al.  On Bounds for Dynamic Programs , 1985, Math. Oper. Res..

[10]  E. Zeidler,et al.  Fixed-point theorems , 1986 .

[11]  S. Verdú,et al.  Abstract dynamic programming models under commutativity conditions , 1987 .

[12]  A. Barto,et al.  Learning and Sequential Decision Making , 1989 .

[13]  Anne Condon,et al.  On Algorithms for Simple Stochastic Games , 1990, Advances In Computational Complexity Theory.

[14]  Richard E. Korf,et al.  Real-Time Heuristic Search , 1990, Artif. Intell..

[15]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[16]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[17]  Anne Condon,et al.  The Complexity of Stochastic Games , 1992, Inf. Comput..

[18]  C. Atkeson,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[19]  Terrence J. Sejnowski,et al.  Temporal Difference Learning of Position Evaluation in the Game of Go , 1993, NIPS.

[20]  Ronald J. Williams,et al.  Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions , 1993 .

[21]  Anton Schwartz,et al.  A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[22]  Andrew G. Barto,et al.  Convergence of Indirect Adaptive Asynchronous Value Iteration Algorithms , 1993, NIPS.

[23]  Ronald J. Williams,et al.  Analysis of Some Incremental Variants of Policy Iteration: First Steps Toward Understanding Actor-Cr , 1993 .

[24]  Satinder Singh Asynchronous Modified Policy Iteration with Single-sided Updates , 1993 .

[25]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[26]  M.A.F. Mcdonald,et al.  Approximate Discounted Dynamic Programming Is Unreliable , 1994 .

[27]  George H. John When the Best Move Isn't Optimal: Q-learning with Exploration , 1994, AAAI.

[28]  Satinder Singh,et al.  An Upper Bound on the Loss from Approximate Optimal-Value Functions , 2004, Machine-mediated learning.

[29]  Sebastian Thrun,et al.  Learning to Play the Game of Chess , 1994, NIPS.

[30]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[31]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[32]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[33]  John N. Tsitsiklis,et al.  Asynchronous stochastic approximation and Q-learning , 1994, Mach. Learn..

[34]  Michael I. Jordan,et al.  Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[35]  Matthias Heger,et al.  Consideration of Risk in Reinforcement Learning , 1994, ICML.

[36]  Gerald Tesauro,et al.  Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[37]  Carlos H. C. Ribeiro Attentional Mechanisms as a Strategy for Generalization in the Q-Learning Algorithm , 1995 .

[38]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[39]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[40]  Steven J. Bradtke,et al.  Incremental dynamic programming for on-line adaptive optimal control , 1995 .

[41]  Gavin Adrian Rummery Problem solving with reinforcement learning , 1995 .

[42]  Csaba Szepesvári,et al.  General Framework for Reinforcement Learning , 1995 .

[43]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[44]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[45]  Richard S. Sutton,et al.  Reinforcement Learning with Replacing Eligibility Traces , 2005, Machine Learning.

[46]  Matthias Heger Risk-averse reinforcement learning , 1996 .

[47]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[48]  Michael L. Littman,et al.  Algorithms for Sequential Decision Making , 1996 .

[49]  Matthias Heger The Loss from Imperfect Value Functions in Expectation-Based and Minimax-Based Tasks , 1996, Machine Learning.

[50]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[51]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[52]  SRIDHAR MAHADEVAN,et al.  Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results , 2005, Machine Learning.