Variational Inference for Data-Efficient Model Learning in POMDPs

Partially observable Markov decision processes (POMDPs) are a powerful abstraction for tasks that require decision making under uncertainty, and capture a wide range of real world tasks. Today, effective planning approaches exist that generate effective strategies given black-box models of a POMDP task. Yet, an open question is how to acquire accurate models for complex domains. In this paper we propose DELIP, an approach to model learning for POMDPs that utilizes amortized structured variational inference. We empirically show that our model leads to effective control strategies when coupled with state-of-the-art planners. Intuitively, model-based approaches should be particularly beneficial in environments with changing reward structures, or where rewards are initially unknown. Our experiments confirm that DELIP is particularly effective in this setting.

[1]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[2]  Romain Laroche,et al.  Hybrid Reward Architecture for Reinforcement Learning , 2017, NIPS.

[3]  Frans A. Oliehoek,et al.  Learning in POMDPs with Monte Carlo Tree Search , 2017, ICML.

[4]  Sergey Levine,et al.  Deep Reinforcement Learning for Vision-Based Robotic Grasping: A Simulated Comparative Evaluation of Off-Policy Methods , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[5]  Ole Winther,et al.  Sequential Neural Models with Stochastic Layers , 2016, NIPS.

[6]  D. Aberdeen,et al.  A ( Revised ) Survey of Approximate Methods for Solving Partially Observable Markov Decision Processes , 2003 .

[7]  Uri Shalit,et al.  Deep Kalman Filters , 2015, ArXiv.

[8]  Sergey Levine,et al.  Learning to Run challenge: Synthesizing physiologically accurate motion using deep reinforcement learning , 2018, ArXiv.

[9]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[10]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[11]  Stefan Lee,et al.  Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Fabio Viola,et al.  Learning and Querying Fast Generative Models for Reinforcement Learning , 2018, ArXiv.

[14]  Noah D. Goodman,et al.  Amortized Inference in Probabilistic Reasoning , 2014, CogSci.

[15]  Razvan Pascanu,et al.  Imagination-Augmented Agents for Deep Reinforcement Learning , 2017, NIPS.

[16]  Yishay Mansour,et al.  Approximate Planning in Large POMDPs via Reusable Trajectories , 1999, NIPS.

[17]  Patrick M. Pilarski,et al.  Temporal-Difference Learning to Assist Human Decision Making during the Control of an Artificial Limb , 2013, ArXiv.

[18]  Longxin Lin Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching , 2004, Machine Learning.

[19]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[20]  Catholijn M. Jonker,et al.  Learning Multimodal Transition Dynamics for Model-Based Reinforcement Learning , 2017, ArXiv.

[21]  Honglak Lee,et al.  Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[22]  Fabio Viola,et al.  Generative Temporal Models with Spatial Memory for Partially Observed Environments , 2018, ICML.

[23]  Peter Stone,et al.  Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[24]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[25]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[26]  Uri Shalit,et al.  Structured Inference Networks for Nonlinear State Space Models , 2016, AAAI.

[27]  Rich Sutton,et al.  A Deeper Look at Planning as Learning from Replay , 2015, ICML.

[28]  Joelle Pineau,et al.  Bayes-Adaptive POMDPs , 2007, NIPS.

[29]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[30]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[31]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over the Infinite Horizon: Discounted Costs , 1978, Oper. Res..

[32]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[33]  Marlos C. Machado,et al.  Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents , 2017, J. Artif. Intell. Res..

[34]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[35]  Joel Veness,et al.  Monte-Carlo Planning in Large POMDPs , 2010, NIPS.

[36]  Nikos A. Vlassis,et al.  Robot Planning in Partially Observable Continuous Domains , 2005, BNAIC.

[37]  Michael I. Jordan,et al.  Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[38]  Marilyn A. Walker,et al.  Reinforcement Learning for Spoken Dialogue Systems , 1999, NIPS.

[39]  Yann LeCun,et al.  Prediction Under Uncertainty with Error-Encoding Networks , 2017, ArXiv.

[40]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[41]  Gabriel Kalweit,et al.  Uncertainty-driven Imagination for Continuous Deep Reinforcement Learning , 2017, CoRL.