论文信息 - Partially Observable Markov Decision Processes With Reward Information: Basic Ideas and Models - 字舞流文

Partially Observable Markov Decision Processes With Reward Information: Basic Ideas and Models

In a partially observable Markov decision process (POMDP), if the reward can be observed at each step, then the observed reward history contains information on the unknown state. This information, in addition to the information contained in the observation history, can be used to update the state probability distribution. The policy thus obtained is called a reward-information policy (RI-policy); an optimal RI-policy performs no worse than any normal optimal policy depending only on the observation history. The above observation leads to four different problem-formulations for POMDPs depending on whether the reward function is known and whether the reward at each step is observable. This exploratory work may attract attention to these interesting problems

Xi-Ren Cao | Xianping Guo | Xi-Ren Cao | Xianping Guo

[1] S. Ross. Arbitrary State Markovian Decision Processes , 1968 .

[2] T. Yoshikawa,et al. Discrete-Time Markovian Decision Processes with Incomplete State Observation , 1970 .

[3] S. Ross. Quality Control under Markovian Deterioration , 1971 .

[4] E. J. Sondik,et al. The Optimal Control of Partially Observable Markov Decision Processes. , 1971 .

[5] D. Rhenius. Incomplete Information in Markovian Decision Models , 1974 .

[6] Robert C. Wang. Computing optimal quality control policies — two actions , 1976 .

[7] Robert C. Wang,et al. OPTIMAL REPLACEMENT POLICY WITH UNOBSERVABLE STATES , 1977 .

[8] C. White. Optimal control-limit strategies for a partially observed replacement problem† , 1979 .

[9] C. White. Bounds on optimal cost for a replacement problem with partial observations , 1979 .

[10] H. Mine,et al. An Optimal Inspection and Replacement Policy under Incomplete State Information: Average Cost Criterion , 1984 .

[11] Hajime Kawai,et al. An optimal inspection and replacement policy under incomplete state information , 1986 .

[12] J. Walrand,et al. Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-Part II: Markovian rewards , 1987 .

[13] William S. Lovejoy,et al. Some Monotonicity Results for Partially Observed Markov Decision Processes , 1987, Oper. Res..

[14] W. Lovejoy. A survey of algorithmic methods for partially observed Markov decision processes , 1991 .

[15] M. K. Ghosh,et al. Discrete-time controlled Markov processes with average cost criterion: a survey , 1993 .

[16] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[17] W. Fleming. Book Review: Discrete-time Markov control processes: Basic optimality criteria , 1997 .

[18] O. Hernández-Lerma,et al. Further topics on discrete-time Markov control processes , 1999 .

[19] Vivek S. Borkar,et al. Average Cost Dynamic Programming Equations For Controlled Markov Chains With Partial Observations , 2000, SIAM J. Control. Optim..

[20] Limiting discounted-cost control of partially observable stochastic systems , 2000, Proceedings of the 39th IEEE Conference on Decision and Control (Cat. No.00CH37187).

[21] Sanjeev R. Kulkarni,et al. Finite-time lower bounds for the two-armed bandit problem , 2000, IEEE Trans. Autom. Control..

[22] V. Borkar. Dynamic programming for ergodic control with partial observations , 2003 .

[23] Xi-Ren Cao,et al. A unified approach to Markov decision problems and performance sensitivity analysis with discounted and average criteria: multichain cases , 2004, at - Automatisierungstechnik.

[24] Xi-Ren Cao,et al. Optimal Control of Ergodic Continuous-Time Markov Chains with Average Sample-Path Rewards , 2005, SIAM J. Control. Optim..

[25] B. Nordstrom. FINITE MARKOV CHAINS , 2005 .

[26] Y. Ho,et al. Vector Ordinal Optimization , 2005 .

[27] L. Platzman. Optimal Infinite-Horizon Undiscounted Control of Finite Probabilistic Systems , 2006 .

[28] Yu-Chi Ho,et al. Constrained Ordinal Optimization—A Feasibility Model Based Approach , 2006, Discret. Event Dyn. Syst..