论文信息 - Partially Observable Markov Decision Processes and Performance Sensitivity Analysis

Partially Observable Markov Decision Processes and Performance Sensitivity Analysis

The sensitivity-based optimization of Markov systems has become an increasingly important area. From the perspective of performance sensitivity analysis, policy-iteration algorithms and gradient estimation methods can be directly obtained for Markov decision processes (MDPs). In this correspondence, the sensitivity-based optimization is extended to average reward partially observable MDPs (POMDPs). We derive the performance-difference and performance-derivative formulas of POMDPs. On the basis of the performance-derivative formula, we present a new method to estimate the performance gradients. From the performance-difference formula, we obtain a sufficient optimality condition without the discounted reward formulation. We also propose a policy-iteration algorithm to obtain a nearly optimal finite-state-controller policy.

Hongsheng Xi | Baoqun Yin | Yanjie Li

[1] Haitao Fang,et al. Potential-based online policy iteration algorithms for Markov decision processes , 2004, IEEE Trans. Autom. Control..

[2] Andrew W. Moore,et al. Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[3] Craig Boutilier,et al. Bounded Finite State Controllers , 2003, NIPS.

[4] Xi-Ren Cao,et al. Basic Ideas for Event-Based Optimization of Markov Systems , 2005, Discret. Event Dyn. Syst..

[5] Eric A. Hansen,et al. An Improved Policy Iteration Algorithm for Partially Observable MDPs , 1997, NIPS.

[6] W. Lovejoy. A survey of algorithmic methods for partially observed Markov decision processes , 1991 .

[7] Xi-Ren Cao,et al. Perturbation realization, potentials, and sensitivity analysis of Markov processes , 1997, IEEE Trans. Autom. Control..

[8] Kevin P. Murphy,et al. A Survey of POMDP Solution Techniques , 2000 .

[9] Ari Arapostathis,et al. On the existence of stationary optimal policies for partially observed MDPs under the long-run average cost criterion , 2006, Syst. Control. Lett..

[10] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[11] Xi-Ren Cao,et al. Stochastic learning and optimization - A sensitivity-based approach , 2007, Annual Reviews in Control.