Sampling Networks and Aggregate Simulation for Online POMDP Planning

The paper introduces a new algorithm for planning in partially observable Markov decision processes (POMDP) based on the idea of aggregate simulation. The algorithm uses product distributions to approximate the belief state and shows how to build a representation graph of an approximate action-value function over belief space. The graph captures the result of simulating the model in aggregate under independence assumptions, giving a symbolic representation of the value function. The algorithm supports large observation spaces using sampling networks, a representation of the process of sampling values of observations, which is integrated into the graph representation. Following previous work in MDPs this approach enables action selection in POMDPs through gradient optimization over the graph representation. This approach complements recent algorithms for POMDPs which are based on particle representations of belief states and an explicit search for action selection. Our approach enables scaling to large factored action spaces in addition to large state spaces and observation spaces. An experimental evaluation demonstrates that the algorithm provides excellent performance relative to state of the art in large POMDP problems.

[1]  Joel Veness,et al.  Monte-Carlo Planning in Large POMDPs , 2010, NIPS.

[2]  Andreas Griewank,et al.  Evaluating derivatives - principles and techniques of algorithmic differentiation, Second Edition , 2000, Frontiers in applied mathematics.

[3]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[4]  Roni Khardon,et al.  Stochastic Planning with Lifted Symbolic Trajectory Optimization , 2019, ICAPS.

[5]  Leslie Pack Kaelbling,et al.  Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[6]  Guy Shani,et al.  Noname manuscript No. (will be inserted by the editor) A Survey of Point-Based POMDP Solvers , 2022 .

[7]  Gerald Tesauro,et al.  On-line Policy Improvement using Monte-Carlo Search , 1996, NIPS.

[8]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[9]  David Hsu,et al.  SARSOP: Efficient Point-Based POMDP Planning by Approximating Optimally Reachable Belief Spaces , 2008, Robotics: Science and Systems.

[10]  Brahim Chaib-draa,et al.  An online POMDP algorithm for complex multiagent environments , 2005, AAMAS '05.

[11]  David Hsu,et al.  DESPOT: Online POMDP Planning with Regularization , 2013, NIPS.

[12]  A. Hasman,et al.  Probabilistic reasoning in intelligent systems: Networks of plausible inference , 1991 .

[13]  Milos Hauskrecht,et al.  Value-Function Approximations for Partially Observable Markov Decision Processes , 2000, J. Artif. Intell. Res..

[14]  David A. McAllester,et al.  Approximate Planning for Factored POMDPs using Belief State Simplification , 1999, UAI.

[15]  Roni Khardon,et al.  From Stochastic Planning to Marginal MAP , 2018, NeurIPS.

[16]  David Hsu,et al.  DESPOT-Alpha: Online POMDP Planning with Large State and Observation Spaces , 2019, Robotics: Science and Systems.

[17]  Roni Khardon,et al.  Online Symbolic Gradient-Based Optimization for Factored Action MDPs , 2016, IJCAI.

[18]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[19]  Michael L. Littman,et al.  Incremental Pruning: A Simple, Fast, Exact Method for Partially Observable Markov Decision Processes , 1997, UAI.

[20]  Alan Fern,et al.  Factored MCTS for Large Scale Stochastic Planning , 2015, AAAI.

[21]  Bram Bakker,et al.  Reinforcement Learning with Long Short-Term Memory , 2001, NIPS.

[22]  Mykel J. Kochenderfer,et al.  Online Algorithms for POMDPs with Continuous State, Action, and Observation Spaces , 2017, ICAPS.

[23]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[24]  Ari Hottinen,et al.  Efficient Planning in Large POMDPs through Policy Graph Based Factorized Approximations , 2010, ECML/PKDD.