Partially Observable Markov Decision Processes With Reward Information: Basic Ideas and Models

In a partially observable Markov decision process (POMDP), if the reward can be observed at each step, then the observed reward history contains information on the unknown state. This information, in addition to the information contained in the observation history, can be used to update the state probability distribution. The policy thus obtained is called a reward-information policy (RI-policy); an optimal RI-policy performs no worse than any normal optimal policy depending only on the observation history. The above observation leads to four different problem-formulations for POMDPs depending on whether the reward function is known and whether the reward at each step is observable. This exploratory work may attract attention to these interesting problems

[1]  S. Ross Arbitrary State Markovian Decision Processes , 1968 .

[2]  T. Yoshikawa,et al.  Discrete-Time Markovian Decision Processes with Incomplete State Observation , 1970 .

[3]  S. Ross Quality Control under Markovian Deterioration , 1971 .

[4]  E. J. Sondik,et al.  The Optimal Control of Partially Observable Markov Decision Processes. , 1971 .

[5]  D. Rhenius Incomplete Information in Markovian Decision Models , 1974 .

[6]  Robert C. Wang Computing optimal quality control policies — two actions , 1976 .

[7]  Robert C. Wang,et al.  OPTIMAL REPLACEMENT POLICY WITH UNOBSERVABLE STATES , 1977 .

[8]  C. White Optimal control-limit strategies for a partially observed replacement problem† , 1979 .

[9]  C. White Bounds on optimal cost for a replacement problem with partial observations , 1979 .

[10]  H. Mine,et al.  An Optimal Inspection and Replacement Policy under Incomplete State Information: Average Cost Criterion , 1984 .

[11]  Hajime Kawai,et al.  An optimal inspection and replacement policy under incomplete state information , 1986 .

[12]  J. Walrand,et al.  Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-Part II: Markovian rewards , 1987 .

[13]  William S. Lovejoy,et al.  Some Monotonicity Results for Partially Observed Markov Decision Processes , 1987, Oper. Res..

[14]  W. Lovejoy A survey of algorithmic methods for partially observed Markov decision processes , 1991 .

[15]  M. K. Ghosh,et al.  Discrete-time controlled Markov processes with average cost criterion: a survey , 1993 .

[16]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[17]  W. Fleming Book Review: Discrete-time Markov control processes: Basic optimality criteria , 1997 .

[18]  O. Hernández-Lerma,et al.  Further topics on discrete-time Markov control processes , 1999 .

[19]  Vivek S. Borkar,et al.  Average Cost Dynamic Programming Equations For Controlled Markov Chains With Partial Observations , 2000, SIAM J. Control. Optim..

[20]  Limiting discounted-cost control of partially observable stochastic systems , 2000, Proceedings of the 39th IEEE Conference on Decision and Control (Cat. No.00CH37187).

[21]  Sanjeev R. Kulkarni,et al.  Finite-time lower bounds for the two-armed bandit problem , 2000, IEEE Trans. Autom. Control..

[22]  V. Borkar Dynamic programming for ergodic control with partial observations , 2003 .

[23]  Xi-Ren Cao,et al.  A unified approach to Markov decision problems and performance sensitivity analysis with discounted and average criteria: multichain cases , 2004, at - Automatisierungstechnik.

[24]  Xi-Ren Cao,et al.  Optimal Control of Ergodic Continuous-Time Markov Chains with Average Sample-Path Rewards , 2005, SIAM J. Control. Optim..

[25]  B. Nordstrom FINITE MARKOV CHAINS , 2005 .

[26]  Y. Ho,et al.  Vector Ordinal Optimization , 2005 .

[27]  L. Platzman Optimal Infinite-Horizon Undiscounted Control of Finite Probabilistic Systems , 2006 .

[28]  Yu-Chi Ho,et al.  Constrained Ordinal Optimization—A Feasibility Model Based Approach , 2006, Discret. Event Dyn. Syst..