Solving Large POMDPs using Real Time Dynamic Programming

Partially Observable Markov Decision Processes (pomdps) are general models of sequential decision problems in which both actions and observations may be noisy. Many problems of interest can be formulated as pomdps yet the use of pomdps has been limited by the lack of eeective algorithms: optimal algorithms don't scale up and heuristic algorithms often do poorly. In this paper, a new pomdp algorithm is introduced that combines the beneets of optimal and heuristic procedures producing good solutions quickly even in problems that are large. Like optimal procedures, the procedure rtdp-bel attempts to solve the information mdp, yet like heuristic procedures, it makes decisions in real time following a suitable heuristic function. rtdp-bel is a Real Time Dynamic Programming algorithm 1], namely, a greedy search algorithm that learns to solve mdps by repeatedly updating the heuristic values of the states that are visited. As shown by Barto et al. such updates eventually deliver an optimal behavior provided that the state space is nite and the initial heuristic values are admissible. Since information mdps have innnite state spaces we discretize probabilities and combine them with heuristic values obtained from the underlying mdp. Although the resulting algorithm is not guaranteed to be optimal, experiments over a number of benchmarks suggest that large pomdps are quickly and consistently solved, and that solutions, if not optimal, tend to be very good.

[1]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[2]  Mark S. Boddy,et al.  An Analysis of Time-Dependent Planning , 1988, AAAI.

[3]  C. Watkins Learning from delayed rewards , 1989 .

[4]  Richard E. Korf,et al.  Real-Time Heuristic Search , 1990, Artif. Intell..

[5]  Andrew McCallum,et al.  Overcoming Incomplete Perception with Utile Distinction Memory , 1993, ICML.

[6]  Leslie Pack Kaelbling,et al.  Acting Optimally in Partially Observable Stochastic Domains , 1994, AAAI.

[7]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[8]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[9]  Stuart J. Russell,et al.  Approximating Optimal Policies for Partially Observable Stochastic Domains , 1995, IJCAI.

[10]  Leslie Pack Kaelbling,et al.  Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[11]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[12]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[13]  Marco Wiering,et al.  HQ-Learning: Discovering Markovian Subgoals for Non-Markovian Reinforcement Learning , 1996 .

[14]  Blai Bonet,et al.  A Robust and Fast Action Selection Mechanism for Planning , 1997, AAAI/IAAI.

[15]  Blai Bonet High-Level Planning and Control with Incomplete Information Using POMDP's , 1998 .