Simulation-based optimization of Markov reward processes

We propose a simulation-based algorithm for optimizing the average reward in a Markov reward process that depends on a set of parameters. As a special case, the method applies to Markov decision processes where optimization takes place within a parametrized set of policies. The algorithm involves the simulation of a single sample path, and can be implemented online. A convergence result (with probability 1) is provided.

[1]  Peter W. Glynn,et al.  Stochastic approximation for Monte Carlo optimization , 1986, WSC '86.

[2]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[3]  Michael C. Fu,et al.  Smoothed perturbation analysis derivative estimation for Markov chains , 1994, Oper. Res. Lett..

[4]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[5]  Peter W. Glynn,et al.  Likelilood ratio gradient estimation: an overview , 1987, WSC '87.

[6]  Xi-Ren Cao,et al.  Perturbation realization, potentials, and sensitivity analysis of Markov processes , 1997, IEEE Trans. Autom. Control..

[7]  Xi-Ren Cao,et al.  Algorithms for sensitivity analysis of Markov systems through potentials and perturbation realization , 1998, IEEE Trans. Control. Syst. Technol..

[8]  Vivek S. Borkar,et al.  Actor-Critic - Type Learning Algorithms for Markov Decision Processes , 1999, SIAM J. Control. Optim..

[9]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[10]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[11]  V. Tresp,et al.  Missing and noisy data in nonlinear time-series prediction , 1995, Proceedings of 1995 IEEE Workshop on Neural Networks for Signal Processing.

[12]  V. Borkar Stochastic approximation with two time scales , 1997 .

[13]  P. L’Ecuyer,et al.  A Unified View of the IPA, SF, and LR Gradient Estimation Techniques , 1990 .

[14]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[15]  D. Bertsekas Gradient convergence in gradient methods , 1997 .

[16]  B. Delyon General results on the convergence of stochastic algorithms , 1996, IEEE Trans. Autom. Control..

[17]  Peter Marbach,et al.  Simulation-based optimization of Markov decision processes , 1998 .

[18]  Donald L. Iglehart,et al.  Importance sampling for stochastic simulations , 1989 .

[19]  Gerald Tesauro,et al.  Practical Issues in Temporal Difference Learning , 1992, Mach. Learn..

[20]  J. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes: implementation issues , 1999, Proceedings of the 38th IEEE Conference on Decision and Control (Cat. No.99CH36304).

[21]  Lennart Ljung,et al.  Analysis of recursive stochastic algorithms , 1977 .

[22]  Paul Glasserman,et al.  Gradient Estimation Via Perturbation Analysis , 1990 .

[23]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[24]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[25]  E. Chong,et al.  Stochastic optimization of regenerative systems using infinitesimal perturbation analysis , 1994, IEEE Trans. Autom. Control..

[26]  Paul Glasserman,et al.  Gradient estimation for regenerative processes , 1992, WSC '92.

[27]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[28]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .