A Scalable Method for Solving High-Dimensional Continuous POMDPs Using Local Approximation

Partially-Observable Markov Decision Processes (POMDPs) are typically solved by finding an approximate global solution to a corresponding belief-MDP. In this paper, we offer a new planning algorithm for POMDPs with continuous state, action and observation spaces. Since such domains have an inherent notion of locality, we can find an approximate solution using local optimization methods. We parameterize the belief distribution as a Gaussian mixture, and use the Extended Kalman Filter (EKF) to approximate the belief update. Since the EKF is a first-order filter, we can marginalize over the observations analytically. By using feedback control and state estimation during policy execution, we recover a behavior that is effectively conditioned on incoming observations despite the unconditioned planning. Local optimization provides no guarantees of global optimality, but it allows us to tackle domains that are at least an order of magnitude larger than the current state-of-the-art. We demonstrate the scalability of our algorithm by considering a simulated hand-eye coordination domain with 16 continuous state dimensions and 6 continuous action dimensions.

[1]  Edward J. Sondik,et al.  The optimal control of par-tially observable Markov processes , 1971 .

[2]  David Q. Mayne,et al.  Differential dynamic programming , 1972, The Mathematical Gazette.

[3]  Y. Bar-Shalom,et al.  Wide-sense adaptive dual control for nonlinear stochastic systems , 1973 .

[4]  Eugene L. Allgower,et al.  Numerical continuation methods - an introduction , 1990, Springer series in computational mathematics.

[5]  L. Liao,et al.  Advantages of Differential Dynamic Programming Over Newton''s Method for Discrete-time Optimal Control Problems , 1992 .

[6]  Robert F. Stengel,et al.  Optimal Control and Estimation , 1994 .

[7]  K. A. Robinson Dictionary of Eye Terminology , 1997 .

[8]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[9]  Sebastian Thrun,et al.  Coastal Navigation with Mobile Robots , 1999, NIPS.

[10]  Sebastian Thrun,et al.  Monte Carlo POMDPs , 1999, NIPS.

[11]  David E. Stewart,et al.  Rigid-Body Dynamics with Friction and Impact , 2000, SIAM Rev..

[12]  Craig Boutilier,et al.  A POMDP formulation of preference elicitation problems , 2002, AAAI/IAAI.

[13]  Shlomo Zilberstein,et al.  Region-Based Incremental Pruning for POMDPs , 2004, UAI.

[14]  Pieter Abbeel,et al.  Exploration and apprenticeship learning in reinforcement learning , 2005, ICML.

[15]  Nikos A. Vlassis,et al.  Planning with Continuous Actions in Partially Observable Environments , 2005, Proceedings of the 2005 IEEE International Conference on Robotics and Automation.

[16]  Jesse Hoey,et al.  Solving POMDPs with Continuous or Large Discrete Observation Spaces , 2005, IJCAI.

[17]  Christopher G. Atkeson,et al.  Policies based on trajectory libraries , 2006, Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006..

[18]  Pascal Poupart,et al.  Point-Based Value Iteration for Continuous POMDPs , 2006, J. Mach. Learn. Res..

[19]  William D. Smart,et al.  Receding Horizon Differential Dynamic Programming , 2007, NIPS.

[20]  William D. Smart,et al.  Bipedal walking on rough terrain using manifold control , 2007, 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[21]  Leslie Pack Kaelbling,et al.  Grasping POMDPs , 2007, Proceedings 2007 IEEE International Conference on Robotics and Automation.

[22]  W.D. Smart,et al.  What does shaping mean for computational reinforcement learning? , 2008, 2008 7th IEEE International Conference on Development and Learning.

[23]  Evan Drumwright,et al.  A Fast and Stable Penalty Method for Rigid Body Simulation , 2008, IEEE Transactions on Visualization and Computer Graphics.

[24]  Leslie Pack Kaelbling,et al.  Continuous-State POMDPs with Hybrid Dynamics , 2008, ISAIM.

[25]  William D. Smart,et al.  Coupling perception and action using minimax optimal control , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[26]  Edwin K. P. Chong,et al.  A POMDP Framework for Coordinated Guidance of Autonomous UAVs for Multitarget Tracking , 2009, EURASIP J. Adv. Signal Process..

[27]  Russ Tedrake,et al.  LQR-trees: Feedback motion planning on sparse randomized trees , 2009, Robotics: Science and Systems.

[28]  N. Roy,et al.  The Belief Roadmap: Efficient Planning in Belief Space by Factoring the Covariance , 2009, Int. J. Robotics Res..

[29]  Pros and Cons of truncated Gaussian EP in the context of Approximate Inference Control , 2009 .

[30]  Leslie Pack Kaelbling,et al.  Belief space planning assuming maximum likelihood observations , 2010, Robotics: Science and Systems.