Stochastic Local Search for POMDP Controllers

The search for finite-state controllers for partially observable Markov decision processes (POMDPs) is often based on approaches like gradient ascent, attractive because of their relatively low computational cost. In this paper, we illustrate a basic problem with gradient-based methods applied to POMDPs, where the sequential nature of the decision problem is at issue, and propose a new stochastic local search method as an alternative. The heuristics used in our procedure mimic the sequential reasoning inherent in optimal dynamic programming (DP) approaches. We show that our algorithm consistently finds higher quality controllers than gradient ascent, and is competitive with (and, for some problems, superior to) other state-of-the-art controller and DP-based algorithms on large-scale POMDPs.

[1]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[2]  Shlomo Zilberstein,et al.  Finite-memory control of partially observable systems , 1998 .

[3]  Andrew W. Moore,et al.  Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[4]  Sebastian Thrun,et al.  Monte Carlo POMDPs , 1999, NIPS.

[5]  Leslie Pack Kaelbling,et al.  Learning Policies with External Memory , 1999, ICML.

[6]  Holger H. Hoos,et al.  Stochastic Local Search-Methods , 1998 .

[7]  Chelsea C. White,et al.  A survey of solution techniques for the partially observed Markov decision process , 1991, Ann. Oper. Res..

[8]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[9]  Douglas Aberdeen,et al.  Scalable Internal-State Policy-Gradient Methods for POMDPs , 2002, ICML.

[10]  Fred W. Glover,et al.  Tabu Search - Part I , 1989, INFORMS J. Comput..

[11]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[12]  Holger H. Hoos,et al.  Stochastic local search - methods, models, applications , 1998, DISKI.

[13]  Michael L. Littman,et al.  Incremental Pruning: A Simple, Fast, Exact Method for Partially Observable Markov Decision Processes , 1997, UAI.

[14]  John N. Tsitsiklis,et al.  The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[15]  Joelle Pineau,et al.  Point-based value iteration: An anytime algorithm for POMDPs , 2003, IJCAI.

[16]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[17]  Nikos A. Vlassis,et al.  A point-based POMDP algorithm for robot planning , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[18]  Craig Boutilier,et al.  Bounded Finite State Controllers , 2003, NIPS.

[19]  Wenju Liu,et al.  Planning in Stochastic Domains: Problem Characteristics and Approximation , 1996 .

[20]  Fred Glover,et al.  Tabu Search - Part II , 1989, INFORMS J. Comput..

[21]  Eric A. Hansen,et al.  Solving POMDPs by Searching in Policy Space , 1998, UAI.

[22]  C. White,et al.  Application of Jensen's inequality to adaptive suboptimal design , 1980 .

[23]  Andrew McCallum,et al.  Instance-Based Utile Distinctions for Reinforcement Learning with Hidden State , 1995, ICML.

[24]  Craig Boutilier,et al.  A POMDP formulation of preference elicitation problems , 2002, AAAI/IAAI.

[25]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[26]  Kee-Eung Kim,et al.  Learning Finite-State Controllers for Partially Observable Environments , 1999, UAI.

[27]  A. Cassandra,et al.  Exact and approximate algorithms for partially observable markov decision processes , 1998 .

[28]  Anne Condon,et al.  On the Undecidability of Probabilistic Planning and Infinite-Horizon Partially Observable Markov Decision Problems , 1999, AAAI/IAAI.

[29]  E. J. Sondik,et al.  The Optimal Control of Partially Observable Markov Decision Processes. , 1971 .

[30]  Kee-Eung Kim,et al.  Solving POMDPs by Searching the Space of Finite Policies , 1999, UAI.

[31]  Leslie Pack Kaelbling,et al.  Acting Optimally in Partially Observable Stochastic Domains , 1994, AAAI.

[32]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[33]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over the Infinite Horizon: Discounted Costs , 1978, Oper. Res..

[34]  Michael L. Littman,et al.  Memoryless policies: theoretical limitations and practical results , 1994 .

[35]  Leslie Pack Kaelbling,et al.  Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[36]  N. Zhang,et al.  Algorithms for partially observable markov decision processes , 2001 .

[37]  Hector Geffner,et al.  Solving Large POMDPs using Real Time Dynamic Programming , 1998 .

[38]  Jürgen Schmidhuber,et al.  HQ-Learning , 1997, Adapt. Behav..

[39]  Eric A. Hansen,et al.  An Improved Policy Iteration Algorithm for Partially Observable MDPs , 1997, NIPS.