论文信息 - Gradient-Based Optimization of Markov Reward Processes: Practical Variants

Gradient-Based Optimization of Markov Reward Processes: Practical Variants

We consider a discrete time, nite state Markov reward process that depends on a set of parameters. In earlier work, we proposed a class of (stochastic) gradient descent methods that tune the parameters in order to optimize the average reward, using a single (possibly simulated) sample path of the process of interest. The resulting algorithms can be implemented online, and have the property that the gradient of the average reward converges to zero with probability 1. There is a drawback, however, in that the updates can have a high variance, resulting in slow convergence. In this paper, we address this issue and propose two approaches to reduce the variance which, however, introduce an additional bias into the update direction. We derive bounds for the resulting bias term and characterize the asymptotic behavior of the gradient of the average reward. For one of the approaches considered, the magnitude of the bias term exhibits an interesting dependence on the mixing time of the underlying Markov chain. We use a call admission control problem to illustrate the performance of one of the algorithms.

J. Tsitsiklis | P. Marbach

[1] Peter W. Glynn,et al. Stochastic approximation for Monte Carlo optimization , 1986, WSC '86.

[2] Michael C. Fu,et al. Smoothed perturbation analysis derivative estimation for Markov chains , 1994, Oper. Res. Lett..

[3] Michael I. Jordan,et al. Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[4] E. Chong,et al. Stochastic optimization of regenerative systems using infinitesimal perturbation analysis , 1994, IEEE Trans. Autom. Control..

[5] Robert G. Gallager,et al. Discrete Stochastic Processes , 1995 .

[6] Shigenobu Kobayashi,et al. Reinforcement Learning in POMDPs with Function Approximation , 1997, ICML.

[7] Xi-Ren Cao,et al. Perturbation realization, potentials, and sensitivity analysis of Markov processes , 1997, IEEE Trans. Autom. Control..

[8] Xi-Ren Cao,et al. Algorithms for sensitivity analysis of Markov systems through potentials and perturbation realization , 1998, IEEE Trans. Control. Syst. Technol..

[9] Peter Marbach,et al. Simulation-based optimization of Markov decision processes , 1998 .

[10] P. Bartlett,et al. Direct Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms , 1999 .

[11] John N. Tsitsiklis,et al. Simulation-based optimization of Markov reward processes , 2001, IEEE Trans. Autom. Control..

[12] P. Glynn. LIKELIHOOD RATIO GRADIENT ESTIMATION : AN OVERVIEW by , 2022 .

[13] K. Schittkowski,et al. NONLINEAR PROGRAMMING , 2022 .