Markov Decision Processes

The theory of Markov Decision Processes is the theory of controlled Markov chains. Its origins can be traced back to R. Bellman and L. Shapley in the 1950’s. During the decades of the last century this theory has grown dramatically. It has found applications in various areas like e.g. computer science, engineering, operations research, biology and economics. In this article we give a short introduction to parts of this theory. We treat Markov Decision Processes with finite and infinite time horizon where we will restrict the presentation to the so-called (generalized) negative case. Solution algorithms like Howard’s policy improvement and linear programming are also explained. Various examples show the application of the theory. We treat stochastic linear-quadratic control problems, bandit problems and dividend pay-out problems.

[1]  L. Shapley,et al.  Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[2]  R. Bellman The theory of dynamic programming , 1954 .

[3]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[4]  K. Miyasawa AN ECONOMIC SURVIVAL GAME , 1961 .

[5]  D. Blackwell Discounted Dynamic Programming , 1965 .

[6]  Onésimo Hernández-Lerma,et al.  Controlled Markov Processes , 1965 .

[7]  David Lindley,et al.  How to Gamble if You Must. (Inequalities for Stochastic Processes) By Lester E. Dubins and Leonard J. Savage. Pp. xiv, 249. 102s. 1965. (McGraw-Hill Book Co.) , 1966, The Mathematical Gazette.

[8]  K. Hinderer,et al.  Foundations of Non-stationary Dynamic Programming with Discrete Time Parameter , 1970 .

[9]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[10]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[11]  J. K. Satia,et al.  Markovian Decision Processes with Uncertain Transition Probabilities , 1973, Oper. Res..

[12]  P. Moerbeke On optimal stopping and free boundary problems , 1973, Advances in Applied Probability.

[13]  Moshe Ben-Horim,et al.  A linear programming approach , 1977 .

[14]  Richard Grinold,et al.  Finite horizon approximations of infinite horizon linear programs , 1977, Math. Program..

[15]  Evan L. Porteus Conditions for characterizing the structure of optimal strategies in infinite-horizon dynamic programs , 1982 .

[16]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[17]  Paul Bratley,et al.  A guide to simulation , 1983 .

[18]  John Rust Structural estimation of markov decision processes , 1986 .

[19]  L. Devroye Non-Uniform Random Variate Generation , 1986 .

[20]  J. Lasserre,et al.  An on-line procedure in discounted infinite-horizon stochastic optimal control , 1986 .

[21]  E. Gilbert,et al.  Optimal infinite-horizon feedback laws for a general class of constrained discrete-time systems: Stability and moving-horizon approximations , 1988 .

[22]  M. K rn,et al.  Stochastic Optimal Control , 1988 .

[23]  O. Hernández-Lerma,et al.  A forecast horizon and a stopping rule for general Markov decision processes , 1988 .

[24]  D. Yao,et al.  Stochastic monotonicity in general queueing networks , 1989, Journal of Applied Probability.

[25]  D. Bertsekas,et al.  Adaptive aggregation methods for infinite horizon dynamic programming , 1989 .

[26]  Christian M. Ernst,et al.  Multi-armed Bandit Allocation Indices , 1989 .

[27]  O. Hernández-Lerma Adaptive Markov Control Processes , 1989 .

[28]  O. Hernández-Lerma,et al.  Error bounds for rolling horizon policies in discrete-time Markov control processes , 1990 .

[29]  D. Mayne,et al.  Receding horizon control of nonlinear systems , 1990 .

[30]  J. Bather,et al.  Multi‐Armed Bandit Allocation Indices , 1990 .

[31]  Steven I. Marcus,et al.  On the computation of the optimal cost function for discrete time Markov models with partial observations , 1991, Ann. Oper. Res..

[32]  Ari Arapostathis,et al.  On the average cost optimality equation and the structure of optimal policies for partially observable Markov decision processes , 1991, Ann. Oper. Res..

[33]  Harald Niederreiter,et al.  Random number generation and Quasi-Monte Carlo methods , 1992, CBMS-NSF regional conference series in applied mathematics.

[34]  R. Weber On the Gittins Index for Multiarmed Bandits , 1992 .

[35]  Shaler Stidham,et al.  A survey of Markov decision models for control of networks of queues , 1993, Queueing Syst. Theory Appl..

[36]  V. Borkar White-noise representations in stochastic realization theory , 1993 .

[37]  M. K. Ghosh,et al.  Discrete-time controlled Markov processes with average cost criterion: a survey , 1993 .

[38]  Anders Martin-Löf,et al.  Lectures on the use of control theory in insurance , 1994 .

[39]  Chelsea C. White,et al.  Markov Decision Processes with Imprecise Transition Probabilities , 1994, Oper. Res..

[40]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[41]  John Rust Using Randomization to Break the Curse of Dimensionality , 1997 .

[42]  Leslie Pack Kaelbling,et al.  On the Complexity of Solving Markov Decision Problems , 1995, UAI.

[43]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[44]  V. Rykov,et al.  Controlled Queueing Systems , 1995 .

[45]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[46]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[47]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[48]  Awi Federgruen,et al.  Detection of minimal forecast horizons in dynamic programs with multiple indicators of the future , 1996 .

[49]  C. Moorehead All rights reserved , 1997 .

[50]  W. N. Patten,et al.  A sliding horizon feedback control problem with feedforward and disturbance , 1997 .

[51]  E. Altman,et al.  On submodular value functions and complex dynamic programming , 1998 .

[52]  Masanori Hosaka,et al.  CONTROLLED MARKOV SET-CHAINS WITH DISCOUNTING , 1998 .

[53]  L. Sennott Stochastic Dynamic Programming and the Control of Queueing Systems , 1998 .

[54]  Steven I. Marcus,et al.  Simulation-Based Algorithms for Average Cost Markov Decision Processes , 1999 .

[55]  O. Hernández-Lerma,et al.  Discrete-time Markov control processes , 1999 .

[56]  E. Altman Constrained Markov Decision Processes , 1999 .

[57]  Jay H. Lee,et al.  Model predictive control: past, present and future , 1999 .

[58]  Craig Boutilier,et al.  Decision-Theoretic Planning: Structural Assumptions and Computational Leverage , 1999, J. Artif. Intell. Res..

[59]  Thomas Parisini,et al.  Neural approximators and team theory for dynamic routing: a receding-horizon approach , 1999, Proceedings of the 38th IEEE Conference on Decision and Control (Cat. No.99CH36304).

[60]  Daniel Hernández-Hernández,et al.  Risk sensitive control of finite state Markov chains in discrete time, with applications to portfolio management , 1999, Math. Methods Oper. Res..

[61]  M. Kukhanova,et al.  Biographical Sketch , 2000, Nucleosides, nucleotides & nucleic acids.

[62]  Daphne Koller,et al.  Policy Iteration for Factored MDPs , 2000, UAI.

[63]  John N. Tsitsiklis,et al.  A survey of computational complexity results in systems and control , 2000, Autom..

[64]  H. Kushner Numerical Methods for Stochastic Control Problems in Continuous Time , 2000 .

[65]  Robert Givan,et al.  Bounded-parameter Markov decision processes , 2000, Artif. Intell..

[66]  S. Marcus,et al.  A Simulation-Based Policy Iteration Algorithm for Average Cost Unichain Markov Decision Processes , 2000 .

[67]  Michael C. Fu,et al.  Monotone Optimal Policies for a Transient Queueing Staffing Problem , 2000, Oper. Res..

[68]  Renaud Lecoeuche Learning Optimal Dialogue Management Rules by Using Reinforcement Learning and Inductive Logic Programming , 2001, NAACL.

[69]  Kurt Driessens,et al.  Speeding Up Relational Reinforcement Learning through the Use of an Incremental First Order Decision Tree Learner , 2001, ECML.

[70]  John N. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes , 2001, IEEE Trans. Autom. Control..

[71]  Craig Boutilier,et al.  Symbolic Dynamic Programming for First-Order MDPs , 2001, IJCAI.

[72]  Shie Mannor,et al.  PAC Bounds for Multi-armed Bandit and Markov Decision Processes , 2002, COLT.

[73]  Vivek S. Borkar,et al.  Convex Analytic Methods in Markov Decision Processes , 2002 .

[74]  Benjamin Van Roy Neuro-Dynamic Programming: Overview and Recent Trends , 2002 .

[75]  Suresh P. Sethi,et al.  Forecast, Solution, and Rolling Horizons in Operations Management Problems: A Classified Bibliography , 2001, Manuf. Serv. Oper. Manag..

[76]  Eugene A. Feinberg,et al.  Handbook of Markov Decision Processes , 2002 .

[77]  W. A. van den Broek Moving horizon control in dynamic games , 2002 .

[78]  James E. Smith,et al.  Structural Properties of Stochastic Dynamic Programs , 2002, Oper. Res..

[79]  Sean P. Meyn,et al.  Risk-Sensitive Optimal Control for Markov Decision Processes with Monotone Cost , 2002, Math. Oper. Res..

[80]  Robert Givan,et al.  Inductive Policy Selection for First-Order MDPs , 2002, UAI.

[81]  Carlos Guestrin,et al.  Generalizing plans to new environments in relational MDPs , 2003, IJCAI 2003.

[82]  John N. Tsitsiklis,et al.  Approximate Gradient Methods in Policy-Space Optimization of Markov Reward Processes , 2003, Discret. Event Dyn. Syst..

[83]  Robert Givan,et al.  Approximate Policy Iteration with a Policy Language Bias , 2003, NIPS.

[84]  H. Föllmer,et al.  American Options, Multi–armed Bandits, and Optimal Consumption Plans: A Unifying View , 2003 .

[85]  Abhijit Gosavi,et al.  Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning , 2003 .

[86]  Kurt Driessens,et al.  Relational Instance Based Regression for Relational Reinforcement Learning , 2003, ICML.

[87]  Benjamin Van Roy,et al.  The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..

[88]  Luc De Raedt,et al.  Logical Markov Decision Programs , 2003 .

[89]  Henk C. Tijms,et al.  A First Course in Stochastic Models: Tijms/Stochastic Models , 2003 .

[90]  Vijay R. Konda,et al.  OnActor-Critic Algorithms , 2003, SIAM J. Control. Optim..

[91]  L. Kallenberg Finite State and Action MDPS , 2003 .

[92]  William L. Cooper,et al.  CONVERGENCE OF SIMULATION-BASED POLICY ITERATION , 2003, Probability in the Engineering and Informational Sciences.

[93]  Abhijit Gosavi,et al.  Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning , 2003 .

[94]  Martin L. Puterman,et al.  Coffee, Tea, or ...?: A Markov Decision Process Model for Airline Meal Provisioning , 2004, Transp. Sci..

[95]  Thomas G. Dietterich,et al.  Explanation-Based Learning and Reinforcement Learning: A Unified View , 1995, Machine Learning.

[96]  Ness B. Shroff,et al.  MARKOV DECISION PROCESSES WITH UNCERTAIN TRANSITION RATES: SENSITIVITY AND MAX HYPHEN MIN CONTROL , 2004 .

[97]  Manfred Schäl,et al.  On Discrete-Time Dynamic Programming in Insurance: Exponential Utility and Minimizing the Ruin Probability , 2004 .

[98]  Haitao Fang,et al.  Potential-based online policy iteration algorithms for Markov decision processes , 2004, IEEE Trans. Autom. Control..

[99]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[100]  John N. Tsitsiklis,et al.  Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[101]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[102]  M. van Otterlo Reinforcement Learning for Relational MDPs , 2004 .

[103]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Suboptimal Control: A Survey from ADP to MPC , 2005, Eur. J. Control.

[104]  Laurent El Ghaoui,et al.  Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[105]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[106]  Casey A. Volino,et al.  A First Course in Stochastic Models , 2005, Technometrics.

[107]  A. Willsky,et al.  Importance sampling actor-critic algorithms , 2006, 2006 American Control Conference.

[108]  Thomas Gärtner,et al.  Graph kernels and Gaussian processes for relational reinforcement learning , 2006, Machine Learning.

[109]  Alʹbert Nikolaevich Shiri︠a︡ev,et al.  Optimal Stopping and Free-Boundary Problems , 2006 .

[110]  Pravin Varaiya,et al.  Simulation-based Uniform Value Function Estimates of Markov Decision Processes , 2006, SIAM J. Control. Optim..

[111]  Sean P. Meyn Control Techniques for Complex Networks: Workload , 2007 .

[112]  K. Ortolon Economic survival. , 2007, Texas medicine.

[113]  Michel Denuit,et al.  Association and heterogeneity of insured lifetimes in the Lee–Carter framework , 2007 .

[114]  Edwin K. P. Chong,et al.  Solving Controlled Markov Set-Chains With Discounting via Multipolicy Improvement , 2007, IEEE Transactions on Automatic Control.

[115]  Dimitri P. Bertsekas,et al.  Stochastic optimal control : the discrete time case , 2007 .

[116]  Alfredo García,et al.  A Decentralized Approach to Discrete Optimization via Simulation: Application to Network Flow , 2007, Oper. Res..

[117]  Warren B. Powell,et al.  Approximate Dynamic Programming: Solving the Curses of Dimensionality (Wiley Series in Probability and Statistics) , 2007 .

[118]  Hyeong Soo Chang,et al.  Finite-Step Approximation Error Bounds for Solving Average-Reward-Controlled Markov Set-Chains , 2008, IEEE Transactions on Automatic Control.

[119]  Alfredo García,et al.  A Game-Theoretic Approach to Efficient Power Management in Sensor Networks , 2008, Oper. Res..

[120]  Xianping Guo,et al.  Continuous-Time Markov Decision Processes: Theory and Applications , 2009 .

[121]  Hyeong Soo Chang Decentralized Learning in Finite Markov Chains: Revisited , 2009, IEEE Transactions on Automatic Control.

[122]  Michael Taksar,et al.  Stochastic Control in Insurance , 2010 .

[123]  Warren B. Powell,et al.  Optimal control of dosage decisions in controlled ovarian hyperstimulation , 2010, Ann. Oper. Res..

[124]  Warren B. Powell,et al.  A dynamic model for the failure replacement of aging high-voltage transformers , 2010 .

[125]  Dimitri P. Bertsekas,et al.  Approximate Dynamic Programming , 2017, Encyclopedia of Machine Learning and Data Mining.

[126]  P. Schrimpf,et al.  Dynamic Programming , 2011 .

[127]  U. Rieder,et al.  Markov Decision Processes with Applications to Finance , 2011 .

[128]  Patrick L. Ode How to Gamble if You Must: Inequalities for Stochastic Processes , 2012 .

[129]  De,et al.  Relational Reinforcement Learning , 2022 .