A potential-based method for finite-stage Markov Decision Process

Finite-stage Markov decision process (MDP) supplies a general framework for many practical problems when only the performance in a finite duration is of interest. Dynamic programming (DP) supplies a general way to find the optimal policies but is usually practically infeasible, due to the exponentially increasing policy space. Approximating the finite-stage MDP by an infinite-stage MDP reduces the search space but usually does not find the optimal stationary policy, due to the approximation error. We develop a method that finds the optimal stationary policies for the finite-stage MDP. The method is based on performance potentials, which can be estimated through sample paths and thus suits practical application.

[1]  Xi-Ren Cao,et al.  A unified approach to Markov decision problems and performance sensitivity analysis , 2000, at - Automatisierungstechnik.

[2]  Xi-Ren Cao,et al.  A basic formula for online policy gradient algorithms , 2005, IEEE Transactions on Automatic Control.

[3]  Xi-Ren Cao,et al.  The Relations Among Potentials, Perturbation Analysis, and Markov Decision Processes , 1998, Discret. Event Dyn. Syst..

[4]  Panos M. Pardalos,et al.  Approximate dynamic programming: solving the curses of dimensionality , 2009, Optim. Methods Softw..

[5]  O. Hernández-Lerma,et al.  Discrete-time Markov control processes , 1999 .

[6]  E. Chong,et al.  Stochastic optimization of regenerative systems using infinitesimal perturbation analysis , 1994, IEEE Trans. Autom. Control..

[7]  W. Fleming Book Review: Discrete-time Markov control processes: Basic optimality criteria , 1997 .

[8]  Samuel Karlin,et al.  A First Course on Stochastic Processes , 1968 .

[9]  L. Breuer Introduction to Stochastic Processes , 2022, Statistical Methods for Climate Scientists.

[10]  Eugene A. Feinberg,et al.  Handbook of Markov Decision Processes , 2002 .

[11]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[12]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[13]  John N. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes , 2001, IEEE Trans. Autom. Control..

[14]  John G. Kemeny,et al.  Finite Markov Chains. , 1960 .

[15]  Warren B. Powell,et al.  Approximate Dynamic Programming - Solving the Curses of Dimensionality , 2007 .

[16]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[17]  Xi-Ren Cao,et al.  Perturbation realization, potentials, and sensitivity analysis of Markov processes , 1997, IEEE Trans. Autom. Control..

[18]  B. Nordstrom FINITE MARKOV CHAINS , 2005 .

[19]  A. Shwartz,et al.  Handbook of Markov decision processes : methods and applications , 2002 .

[20]  Xi-Ren Cao,et al.  From Perturbation Analysis to Markov Decision Processes and Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[21]  Xi-Ren Cao,et al.  Algorithms for sensitivity analysis of Markov systems through potentials and perturbation realization , 1998, IEEE Trans. Control. Syst. Technol..

[22]  Erhan Çinlar,et al.  Introduction to stochastic processes , 1974 .

[23]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[24]  Xi-Ren Cao,et al.  Stochastic learning and optimization - A sensitivity-based approach , 2007, Annu. Rev. Control..

[25]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[26]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .