Visual Learning-based Planning for Continuous High-Dimensional POMDPs

The Partially Observable Markov Decision Process (POMDP) is a powerful framework for capturing decisionmaking problems that involve state and transition uncertainty. However, most current POMDP planners cannot effectively handle very high-dimensional observations they often encounter in the real world (e.g. image observations in robotic domains). In this work, we propose Visual Tree Search (VTS), a learning and planning procedure that combines generative models learned offline with online model-based POMDP planning. VTS bridges offline model training and online planning by utilizing a set of deep generative observation models to predict and evaluate the likelihood of image observations in a Monte Carlo tree search planner. We show that VTS is robust to different observation noises and, since it utilizes online, model-based planning, can adapt to different reward structures without the need to re-train. This new approach outperforms a baseline state-of-the-art on-policy planning algorithm while using significantly less offline training time.

[1]  Wee Sun Lee,et al.  DESPOT-α: Online POMDP Planning With Large State And Observation Spaces , 2010 .

[2]  F Gustafsson,et al.  Particle filter theory and practice with positioning applications , 2010, IEEE Aerospace and Electronic Systems Magazine.

[3]  Joel Veness,et al.  Monte-Carlo Planning in Large POMDPs , 2010, NIPS.

[4]  Matthew Riemer,et al.  Routing Networks: Adaptive Selection of Non-linear Functions for Multi-Task Learning , 2017, ICLR.

[5]  John N. Tsitsiklis,et al.  The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[6]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[7]  David Hsu,et al.  QMDP-Net: Deep Learning for Planning under Partial Observability , 2017, NIPS.

[8]  Elliot Meyerson,et al.  Beyond Shared Hierarchies: Deep Multitask Learning through Soft Layer Ordering , 2017, ICLR.

[9]  Simon M. Lucas,et al.  A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[10]  Christopher D. Manning,et al.  Compositional Attention Networks for Machine Reasoning , 2018, ICLR.

[11]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[12]  Arnaud Doucet,et al.  Differentiable Particle Filtering via Entropy-Regularized Optimal Transport , 2021, ICML.

[13]  Yi Wu,et al.  Multi-Task Reinforcement Learning with Soft Modularization , 2020, NeurIPS.

[14]  Jonathan P. How,et al.  Decision Making Under Uncertainty: Theory and Application , 2015 .

[15]  Leslie Pack Kaelbling,et al.  Modular meta-learning , 2018, CoRL.

[16]  Jürgen Schmidhuber,et al.  Recurrent World Models Facilitate Policy Evolution , 2018, NeurIPS.

[17]  Mykel J. Kochenderfer,et al.  The value of inferring the internal state of traffic participants for autonomous freeway driving , 2017, 2017 American Control Conference (ACC).

[18]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[19]  Mykel J. Kochenderfer,et al.  Bayesian Optimized Monte Carlo Planning , 2020, ArXiv.

[20]  Dan Klein,et al.  Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Mykel J. Kochenderfer,et al.  Online Algorithms for POMDPs with Continuous State, Action, and Observation Spaces , 2017, ICAPS.

[22]  David Hsu,et al.  Intention-aware online POMDP planning for autonomous driving in a crowd , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[23]  Honglak Lee,et al.  Learning Structured Output Representation using Deep Conditional Generative Models , 2015, NIPS.

[24]  Hanna Kurniawati,et al.  An On-Line POMDP Solver for Continuous Observation Spaces , 2020, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[25]  Oliver Brock,et al.  Differentiable Particle Filters: End-to-End Learning with Algorithmic Priors , 2018, Robotics: Science and Systems.

[26]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[27]  David Barber,et al.  Modular Networks: Learning to Decompose Neural Computation , 2018, NeurIPS.

[28]  Murray Shanahan,et al.  Learning to Combine Top-Down and Bottom-Up Signals in Recurrent Neural Networks with Attention over Modules , 2020, ICML.

[29]  Mykel J. Kochenderfer,et al.  Optimizing the Next Generation Collision Avoidance System for Safe, Suitable, and Acceptable Operational Performance , 2013 .

[30]  Milica Gasic,et al.  POMDP-Based Statistical Spoken Dialog Systems: A Review , 2013, Proceedings of the IEEE.

[31]  David Hsu,et al.  DESPOT: Online POMDP Planning with Regularization , 2013, NIPS.

[32]  M. H. Lim,et al.  Sparse tree search optimality guarantees in POMDPs with continuous observation spaces , 2019, IJCAI.

[33]  Hanna Kurniawati,et al.  An Online POMDP Solver for Uncertainty Planning in Dynamic Environment , 2013, ISRR.

[34]  Silvio Savarese,et al.  Joint 2D-3D-Semantic Data for Indoor Scene Understanding , 2017, ArXiv.

[35]  Shimon Whiteson,et al.  Deep Variational Reinforcement Learning for POMDPs , 2018, ICML.

[36]  Turgay Ayer,et al.  OR Forum - A POMDP Approach to Personalize Mammography Screening Decisions , 2012, Oper. Res..

[37]  Sergey Levine,et al.  Learning modular neural network policies for multi-task and multi-robot transfer , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).