Partially Observable Markov Decision Processes and Performance Sensitivity Analysis

The sensitivity-based optimization of Markov systems has become an increasingly important area. From the perspective of performance sensitivity analysis, policy-iteration algorithms and gradient estimation methods can be directly obtained for Markov decision processes (MDPs). In this correspondence, the sensitivity-based optimization is extended to average reward partially observable MDPs (POMDPs). We derive the performance-difference and performance-derivative formulas of POMDPs. On the basis of the performance-derivative formula, we present a new method to estimate the performance gradients. From the performance-difference formula, we obtain a sufficient optimality condition without the discounted reward formulation. We also propose a policy-iteration algorithm to obtain a nearly optimal finite-state-controller policy.

[1]  Haitao Fang,et al.  Potential-based online policy iteration algorithms for Markov decision processes , 2004, IEEE Trans. Autom. Control..

[2]  Andrew W. Moore,et al.  Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[3]  Craig Boutilier,et al.  Bounded Finite State Controllers , 2003, NIPS.

[4]  Xi-Ren Cao,et al.  Basic Ideas for Event-Based Optimization of Markov Systems , 2005, Discret. Event Dyn. Syst..

[5]  Eric A. Hansen,et al.  An Improved Policy Iteration Algorithm for Partially Observable MDPs , 1997, NIPS.

[6]  W. Lovejoy A survey of algorithmic methods for partially observed Markov decision processes , 1991 .

[7]  Xi-Ren Cao,et al.  Perturbation realization, potentials, and sensitivity analysis of Markov processes , 1997, IEEE Trans. Autom. Control..

[8]  Kevin P. Murphy,et al.  A Survey of POMDP Solution Techniques , 2000 .

[9]  Ari Arapostathis,et al.  On the existence of stationary optimal policies for partially observed MDPs under the long-run average cost criterion , 2006, Syst. Control. Lett..

[10]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[11]  Xi-Ren Cao,et al.  Stochastic learning and optimization - A sensitivity-based approach , 2007, Annual Reviews in Control.

[12]  A. Cassandra,et al.  Exact and approximate algorithms for partially observable markov decision processes , 1998 .

[13]  Kee-Eung Kim,et al.  Learning Finite-State Controllers for Partially Observable Environments , 1999, UAI.

[14]  Xi-Ren Cao,et al.  From Perturbation Analysis to Markov Decision Processes and Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[15]  Xi-Ren Cao,et al.  A unified approach to Markov decision problems and performance sensitivity analysis with discounted and average criteria: multichain cases , 2004, at - Automatisierungstechnik.

[16]  D. Bertsekas,et al.  Approximate solution methods for partially observable markov and semi-markov decision processes , 2006 .

[17]  Douglas Aberdeen,et al.  Policy-Gradient Algorithms for Partially Observable Markov Decision Processes , 2003 .

[18]  Milos Hauskrecht,et al.  Value-Function Approximations for Partially Observable Markov Decision Processes , 2000, J. Artif. Intell. Res..

[19]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over the Infinite Horizon: Discounted Costs , 1978, Oper. Res..

[20]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..