Dual Sequential Monte Carlo: Tunneling Filtering and Planning in Continuous POMDPs

We present the DualSMC network that solves continuous POMDPs by learning belief representations and then leveraging them for planning. It is based on the fact that filtering, i.e. state estimation, and planning can be viewed as two related sequential Monte Carlo processes, with one in the belief space and the other in the future planning trajectory space. In particular, we first introduce a novel particle filter network that makes better use of the adversarial relationship between the proposer model and the observation model. We then introduce a new planning algorithm over the belief representations, which learns uncertainty-dependent policies. We allow these two parts to be trained jointly with each other. We testify the effectiveness of our approach on three continuous control and planning tasks: the floor positioning, the 3D light-dark navigation, and a modified Reacher task.

[1]  Surya P. N. Singh,et al.  An online and approximate solver for POMDPs with continuous action space , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[2]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[3]  Shie Mannor,et al.  A Tutorial on the Cross-Entropy Method , 2005, Ann. Oper. Res..

[4]  Marc Toussaint,et al.  Probabilistic inference for solving discrete and continuous state Markov Decision Processes , 2006, ICML.

[5]  Sergey Levine,et al.  Variational Policy Search via Trajectory Optimization , 2013, NIPS.

[6]  Sergey Levine,et al.  Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review , 2018, ArXiv.

[7]  Jonathan P. How,et al.  Graph-based Cross Entropy method for solving multi-robot decentralized POMDPs , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[8]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[9]  Arnaud Doucet,et al.  A survey of convergence results on particle filtering methods for practitioners , 2002, IEEE Trans. Signal Process..

[10]  John N. Tsitsiklis,et al.  The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[11]  Emanuel Todorov,et al.  General duality between optimal control and estimation , 2008, 2008 47th IEEE Conference on Decision and Control.

[12]  Marc Toussaint,et al.  Robot trajectory optimization using approximate inference , 2009, ICML '09.

[13]  Leslie Pack Kaelbling,et al.  Belief space planning assuming maximum likelihood observations , 2010, Robotics: Science and Systems.

[14]  Santosha K. Dwivedy,et al.  Reinforcement Learning via Recurrent Convolutional Neural Networks , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[15]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[16]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[17]  Joel Veness,et al.  Monte-Carlo Planning in Large POMDPs , 2010, NIPS.

[18]  David Hsu,et al.  Particle Filter Networks with Application to Visual Localization , 2018, CoRL.

[19]  Rémi Munos,et al.  Particle Filter-based Policy Gradient in POMDPs , 2008, NIPS.

[20]  Oliver Brock,et al.  Differentiable Particle Filters: End-to-End Learning with Algorithmic Priors , 2018, Robotics: Science and Systems.

[21]  Pieter Abbeel,et al.  Value Iteration Networks , 2016, NIPS.

[22]  J. Andrew Bagnell,et al.  Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[23]  Yoshua Bengio,et al.  Probabilistic Planning with Sequential Monte Carlo methods , 2018, ICLR.

[24]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[25]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[26]  Pascal Poupart,et al.  On Improving Deep Reinforcement Learning for POMDPs , 2017, ArXiv.

[27]  Marc Toussaint,et al.  An Approximate Inference Approach to Temporal Optimization in Optimal Control , 2010, NIPS.

[28]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[29]  Vicenç Gómez,et al.  Optimal control as a graphical model inference problem , 2009, Machine Learning.

[30]  David Hsu,et al.  QMDP-Net: Deep Learning for Planning under Partial Observability , 2017, NIPS.

[31]  Nikos A. Vlassis,et al.  The Cross-Entropy Method for Policy Search in Decentralized POMDPs , 2008, Informatica.

[32]  Shimon Whiteson,et al.  Deep Variational Reinforcement Learning for POMDPs , 2018, ICML.

[33]  Peter Stone,et al.  Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[34]  David Hsu,et al.  Particle Filter Networks: End-to-End Probabilistic Localization From Visual Observations , 2018, ArXiv.

[35]  David Hsu,et al.  Integrating Algorithmic Planning and Deep Learning for Partially Observable Navigation , 2018, ArXiv.

[36]  Leslie Pack Kaelbling,et al.  Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[37]  Mykel J. Kochenderfer,et al.  Online Algorithms for POMDPs with Continuous State, Action, and Observation Spaces , 2017, ICAPS.