Potential-based online policy iteration algorithms for Markov decision processes

Performance potentials play a crucial role in performance sensitivity analysis and policy iteration of Markov decision processes. The potentials can be estimated on a single sample path of a Markov process. In this paper, we propose two potential-based online policy iteration algorithms for performance optimization of Markov systems. The algorithms are based on online estimation of potentials and stochastic approximation. We prove that with these two algorithms the optimal policy can be attained after a finite number of iterations. A simulation example is given to illustrate the main ideas and the convergence rates of the algorithms.

[1]  Xi-Ren Cao,et al.  Perturbation analysis and optimization of queueing networks , 1983 .

[2]  Peter W. Glynn,et al.  Likelilood ratio gradient estimation: an overview , 1987, WSC '87.

[3]  Rajan Suri,et al.  Single Run Optimization of Discrete Event Simulations—An Empirical Study Using the M/M/l Queue , 1989 .

[4]  P. Glynn Optimization of stochastic systems via simulation , 1989, WSC '89.

[5]  Alan Weiss,et al.  Sensitivity Analysis for Simulations via Likelihood Ratios , 1989, Oper. Res..

[6]  Xi-Ren Cao,et al.  Perturbation analysis of discrete event dynamic systems , 1991 .

[7]  Peter W. Glynn,et al.  Gradient estimation for ratios , 1991, 1991 Winter Simulation Conference Proceedings..

[8]  P. R. Kumar,et al.  Re-entrant lines , 1993, Queueing Syst. Theory Appl..

[9]  Richard L. Tweedie,et al.  Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[10]  M. K. Ghosh,et al.  Discrete-time controlled Markov processes with average cost criterion: a survey , 1993 .

[11]  Michael C. Fu,et al.  Smoothed perturbation analysis derivative estimation for Markov chains , 1994, Oper. Res. Lett..

[12]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[13]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[14]  E. Chong,et al.  Stochastic optimization of regenerative systems using infinitesimal perturbation analysis , 1994, IEEE Trans. Autom. Control..

[15]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[16]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[17]  Stephen M. Robinson,et al.  Sample-path optimization of convex stochastic performance functions , 1996, Math. Program..

[18]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[19]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[20]  Sean P. Meyn The policy iteration algorithm for average reward Markov decision processes with general state space , 1997, IEEE Trans. Autom. Control..

[21]  Xi-Ren Cao,et al.  Perturbation realization, potentials, and sensitivity analysis of Markov processes , 1997, IEEE Trans. Autom. Control..

[22]  Xi-Ren Cao,et al.  Algorithms for sensitivity analysis of Markov systems through potentials and perturbation realization , 1998, IEEE Trans. Control. Syst. Technol..

[23]  Xi-Ren Cao,et al.  The Relations Among Potentials, Perturbation Analysis, and Markov Decision Processes , 1998, Discret. Event Dyn. Syst..

[24]  John N. Tsitsiklis,et al.  Average cost temporal-difference learning , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[25]  Loo Hay Lee,et al.  Explanation of goal softening in ordinal optimization , 1999, IEEE Trans. Autom. Control..

[26]  Christos G. Cassandras,et al.  Introduction to Discrete Event Systems , 1999, The Kluwer International Series on Discrete Event Dynamic Systems.

[27]  Vivek S. Borkar,et al.  Actor-Critic - Type Learning Algorithms for Markov Decision Processes , 1999, SIAM J. Control. Optim..

[28]  X. Cao,et al.  Single Sample Path-Based Optimization of Markov Chains , 1999 .

[29]  John Odentrantz,et al.  Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues , 2000, Technometrics.

[30]  John N. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes , 2001, IEEE Trans. Autom. Control..

[31]  Zhiyuan Ren,et al.  A time aggregation approach to Markov decision processes , 2002, Autom..

[32]  Xi-Ren Cao,et al.  Gradient-based policy iteration: an example , 2002, Proceedings of the 41st IEEE Conference on Decision and Control, 2002..

[33]  William L. Cooper,et al.  CONVERGENCE OF SIMULATION-BASED POLICY ITERATION , 2003, Probability in the Engineering and Informational Sciences.

[34]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[35]  John N. Tsitsiklis,et al.  Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[36]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[37]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.