论文信息 - Visual Learning-based Planning for Continuous High-Dimensional POMDPs

Visual Learning-based Planning for Continuous High-Dimensional POMDPs

The Partially Observable Markov Decision Process (POMDP) is a powerful framework for capturing decisionmaking problems that involve state and transition uncertainty. However, most current POMDP planners cannot effectively handle very high-dimensional observations they often encounter in the real world (e.g. image observations in robotic domains). In this work, we propose Visual Tree Search (VTS), a learning and planning procedure that combines generative models learned offline with online model-based POMDP planning. VTS bridges offline model training and online planning by utilizing a set of deep generative observation models to predict and evaluate the likelihood of image observations in a Monte Carlo tree search planner. We show that VTS is robust to different observation noises and, since it utilizes online, model-based planning, can adapt to different reward structures without the need to re-train. This new approach outperforms a baseline state-of-the-art on-policy planning algorithm while using significantly less offline training time.

[1] Wee Sun Lee,et al. DESPOT-α: Online POMDP Planning With Large State And Observation Spaces , 2010 .

[2] F Gustafsson,et al. Particle filter theory and practice with positioning applications , 2010, IEEE Aerospace and Electronic Systems Magazine.

[3] Joel Veness,et al. Monte-Carlo Planning in Large POMDPs , 2010, NIPS.

[4] Matthew Riemer,et al. Routing Networks: Adaptive Selection of Non-linear Functions for Multi-Task Learning , 2017, ICLR.

[5] John N. Tsitsiklis,et al. The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[6] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[7] David Hsu,et al. QMDP-Net: Deep Learning for Planning under Partial Observability , 2017, NIPS.

[8] Elliot Meyerson,et al. Beyond Shared Hierarchies: Deep Multitask Learning through Soft Layer Ordering , 2017, ICLR.

[9] Simon M. Lucas,et al. A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[10] Christopher D. Manning,et al. Compositional Attention Networks for Machine Reasoning , 2018, ICLR.

[11] Leslie Pack Kaelbling,et al. Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[12] Arnaud Doucet,et al. Differentiable Particle Filtering via Entropy-Regularized Optimal Transport , 2021, ICML.

[13] Yi Wu,et al. Multi-Task Reinforcement Learning with Soft Modularization , 2020, NeurIPS.

[14] Jonathan P. How,et al. Decision Making Under Uncertainty: Theory and Application , 2015 .

[15] Leslie Pack Kaelbling,et al. Modular meta-learning , 2018, CoRL.

[16] Jürgen Schmidhuber,et al. Recurrent World Models Facilitate Policy Evolution , 2018, NeurIPS.

[17] Mykel J. Kochenderfer,et al. The value of inferring the internal state of traffic participants for autonomous freeway driving , 2017, 2017 American Control Conference (ACC).

[18] Simon Osindero,et al. Conditional Generative Adversarial Nets , 2014, ArXiv.

[19] Mykel J. Kochenderfer,et al. Bayesian Optimized Monte Carlo Planning , 2020, ArXiv.

[20] Dan Klein,et al. Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Mykel J. Kochenderfer,et al. Online Algorithms for POMDPs with Continuous State, Action, and Observation Spaces , 2017, ICAPS.

[22] David Hsu,et al. Intention-aware online POMDP planning for autonomous driving in a crowd , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[23] Honglak Lee,et al. Learning Structured Output Representation using Deep Conditional Generative Models , 2015, NIPS.

[24] Hanna Kurniawati,et al. An On-Line POMDP Solver for Continuous Observation Spaces , 2020, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[25] Oliver Brock,et al. Differentiable Particle Filters: End-to-End Learning with Algorithmic Priors , 2018, Robotics: Science and Systems.

[26] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[27] David Barber,et al. Modular Networks: Learning to Decompose Neural Computation , 2018, NeurIPS.

[28] Murray Shanahan,et al. Learning to Combine Top-Down and Bottom-Up Signals in Recurrent Neural Networks with Attention over Modules , 2020, ICML.

[29] Mykel J. Kochenderfer,et al. Optimizing the Next Generation Collision Avoidance System for Safe, Suitable, and Acceptable Operational Performance , 2013 .

[30] Milica Gasic,et al. POMDP-Based Statistical Spoken Dialog Systems: A Review , 2013, Proceedings of the IEEE.

[31] David Hsu,et al. DESPOT: Online POMDP Planning with Regularization , 2013, NIPS.

[32] M. H. Lim,et al. Sparse tree search optimality guarantees in POMDPs with continuous observation spaces , 2019, IJCAI.

[33] Hanna Kurniawati,et al. An Online POMDP Solver for Uncertainty Planning in Dynamic Environment , 2013, ISRR.

[34] Silvio Savarese,et al. Joint 2D-3D-Semantic Data for Indoor Scene Understanding , 2017, ArXiv.

[35] Shimon Whiteson,et al. Deep Variational Reinforcement Learning for POMDPs , 2018, ICML.

[36] Turgay Ayer,et al. OR Forum - A POMDP Approach to Personalize Mammography Screening Decisions , 2012, Oper. Res..

[37] Sergey Levine,et al. Learning modular neural network policies for multi-task and multi-robot transfer , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).