论文信息 - The Witness Algorithm: Solving Partially Observable Markov Decision Processes

The Witness Algorithm: Solving Partially Observable Markov Decision Processes

Markov decision processes (MDP''s) are a mathematical formalization of problems in which a decision-maker must choose how to act to maximize its reward over a series of interactions with its environment. Partially observable Markov decision processes (POMDP''s) generalize the MDP framework to the case where the agent must make its decisions in partial ignorance of its current situation. This paper describes the POMDP framework and presents some well-known results from the field. It then presents a novel method called the witness algorithm for solving POMDP problems and analyzes its computational complexity. The paper argues that the witness algorithm is superior to existing algorithms for solving POMDP''s in an important complexity-theoretic sense.

M. Littman

[1] Alvin W Drake,et al. Observation of a Markov process through a noisy channel , 1962 .

[2] Edward J. Sondik,et al. The optimal control of par-tially observable Markov processes , 1971 .

[3] Edward J. Sondik,et al. The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[4] Loren K. Platzman,et al. Finite memory estimation and control of finite probabilistic systems , 1977 .

[5] David S. Johnson,et al. Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[6] James N. Eagle. The Optimal Search for a Moving Target When the Search Path Is Constrained , 1984, Oper. Res..

[7] Dimitri P. Bertsekas,et al. Dynamic Programming: Deterministic and Stochastic Models , 1987 .

[8] Ronald J. Williams,et al. Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions , 1993 .

[9] Leslie Pack Kaelbling,et al. Acting Optimally in Partially Observable Stochastic Domains , 1994, AAAI.

[10] Anthony R. Cassandra,et al. Optimal Policies for Partially Observable Markov Decision Processes , 1994 .

[11] Ben J. A. Kröse,et al. Learning from delayed rewards , 1995, Robotics Auton. Syst..