Analyzing and Escaping Local Optima in Planning as Inference for Partially Observable Domains

Planning as inference recently emerged as a versatile approach to decision-theoretic planning and reinforcement learning for single and multi-agent systems in fully and partially observable domains with discrete and continuous variables. Since planning as inference essentially tackles a non-convex optimization problem when the states are partially observable, there is a need to develop techniques that can robustly escape local optima. We investigate the local optima of finite state controllers in single agent partially observable Markov decision processes (POMDPs) that are optimized by expectation maximization (EM). We show that EM converges to controllers that are optimal with respect to a one-step lookahead. To escape local optima, we propose two algorithms: the first one adds nodes to the controller to ensure optimality with respect to a multi-step lookahead, while the second one splits nodes in a greedy fashion to improve reward likelihood. The approaches are demonstrated empirically on benchmark problems.

[1]  Leslie Pack Kaelbling,et al.  Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[2]  David Hsu,et al.  SARSOP: Efficient Point-Based POMDP Planning by Approximating Optimally Reachable Belief Spaces , 2008, Robotics: Science and Systems.

[3]  P. Poupart Exploiting structure to efficiently solve large scale partially observable Markov decision processes , 2005 .

[4]  Craig Boutilier,et al.  Stochastic Local Search for POMDP Controllers , 2004, AAAI.

[5]  Marc Toussaint,et al.  Probabilistic inference for solving (PO) MDPs , 2006 .

[6]  Shlomo Zilberstein,et al.  Anytime Planning for Decentralized POMDPs using Expectation Maximization , 2010, UAI.

[7]  Reid G. Simmons,et al.  Point-Based POMDP Algorithms: Improved Analysis and Implementation , 2005, UAI.

[8]  Craig Boutilier,et al.  Bounded Finite State Controllers , 2003, NIPS.

[9]  Marc Toussaint,et al.  Model-free reinforcement learning as mixture learning , 2009, ICML '09.

[10]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[11]  Marc Toussaint,et al.  Probabilistic inference for solving discrete and continuous state Markov Decision Processes , 2006, ICML.

[12]  Marc Toussaint,et al.  Hierarchical POMDP Controller Optimization by Likelihood Maximization , 2008, UAI.

[13]  Eric A. Hansen,et al.  An Improved Policy Iteration Algorithm for Partially Observable MDPs , 1997, NIPS.

[14]  A. Cassandra,et al.  Exact and approximate algorithms for partially observable markov decision processes , 1998 .

[15]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over the Infinite Horizon: Discounted Costs , 1978, Oper. Res..

[16]  Joelle Pineau,et al.  Tractable planning under uncertainty: exploiting structure , 2004 .

[17]  Nando de Freitas,et al.  New inference strategies for solving Markov Decision Processes using reversible jump MCMC , 2009, UAI.

[18]  Shlomo Zilberstein,et al.  Solving POMDPs using quadratically constrained linear programs , 2006, AAMAS '06.

[19]  Blai Bonet,et al.  Finite-State Controllers Based on Mealy Machines for Centralized and Decentralized POMDPs , 2010, AAAI.

[20]  Wolfram Burgard,et al.  Robotics: Science and Systems XV , 2010 .

[21]  Andrew W. Moore,et al.  Fast State Discovery for HMM Model Selection and Learning , 2007, AISTATS.

[22]  Kee-Eung Kim,et al.  Solving POMDPs by Searching the Space of Finite Policies , 1999, UAI.