VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning

Trading off exploration and exploitation in an unknown environment is key to maximising expected return during learning. A Bayes-optimal policy, which does so optimally, conditions its actions not only on the environment state but on the agent's uncertainty about the environment. Computing a Bayes-optimal policy is however intractable for all but the smallest tasks. In this paper, we introduce variational Bayes-Adaptive Deep RL (variBAD), a way to meta-learn to perform approximate inference in an unknown environment, and incorporate task uncertainty directly during action selection. In a grid-world domain, we illustrate how variBAD performs structured online exploration as a function of task uncertainty. We further evaluate variBAD on MuJoCo domains widely used in meta-RL and show that it achieves higher online return than existing methods.

[1]  Li Zhang,et al.  Learning to Learn: Meta-Critic Networks for Sample Efficient Learning , 2017, ArXiv.

[2]  Karol Hausman,et al.  Learning an Embedding Space for Transferable Robot Skills , 2018, ICLR.

[3]  Zoran Popovic,et al.  Trading Off Scientific Knowledge and User Learning with Multi-Armed Bandits , 2014, EDM.

[4]  Peter Dayan,et al.  Scalable and Efficient Bayes-Adaptive Reinforcement Learning Based on Monte-Carlo Tree Search , 2013, J. Artif. Intell. Res..

[5]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 1998, Machine Learning.

[6]  Lihong Li,et al.  Policy Certificates: Towards Accountable Reinforcement Learning , 2018, ICML.

[7]  P. Randolph Bayesian Decision Problems and Markov Chains , 1968 .

[8]  Danica Kragic,et al.  VPE: Variational Policy Embedding for Transfer Reinforcement Learning , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[9]  Zhenguo Li,et al.  Meta Reinforcement Learning with Task Embedding and Shared Policy , 2019, IJCAI.

[10]  Pratik Shah,et al.  Reinforcement Learning with Action-Derived Rewards for Chemotherapy and Clinical Trial Dosing Regimen Selection , 2018, MLHC.

[11]  Yee Whye Teh,et al.  Meta reinforcement learning as task inference , 2019, ArXiv.

[12]  R. Bellman A PROBLEM IN THE SEQUENTIAL DESIGN OF EXPERIMENTS , 1954 .

[13]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[14]  Sebastian Nowozin,et al.  Meta-Learning Probabilistic Inference for Prediction , 2018, ICLR.

[15]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[16]  Peter L. Bartlett,et al.  RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning , 2016, ArXiv.

[17]  Jesse Hoey,et al.  An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[18]  Andrew Y. Ng,et al.  Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[19]  Finale Doshi-Velez,et al.  Robust and Efficient Transfer Learning with Hidden Parameter Markov Decision Processes , 2017, AAAI.

[20]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[21]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[22]  Joelle Pineau,et al.  Decoupling Dynamics and Reward for Transfer Learning , 2018, ICLR.

[23]  J. Schulman,et al.  Reptile: a Scalable Metalearning Algorithm , 2018 .

[24]  Michael L. Littman,et al.  Learning is planning: near Bayes-optimal reinforcement learning via Monte-Carlo tree search , 2011, UAI.

[25]  Nando de Freitas,et al.  Robust Imitation of Diverse Behaviors , 2017, NIPS.

[26]  Leslie Pack Kaelbling,et al.  Acting Optimally in Partially Observable Stochastic Domains , 1994, AAAI.

[27]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[28]  Katja Hofmann,et al.  Variational Inference for Data-Efficient Model Learning in POMDPs , 2018, ArXiv.

[29]  Siddhartha S. Srinivasa,et al.  Bayesian Policy Optimization for Model Uncertainty , 2018, ICLR.

[30]  Ambuj Tewari,et al.  Contextual Markov Decision Processes using Generalized Linear Models , 2019, ArXiv.

[31]  Richard L. Lewis,et al.  Variance-Based Rewards for Approximate Bayesian Reinforcement Learning , 2010, UAI.

[32]  Sergey Levine,et al.  Meta-Reinforcement Learning of Structured Exploration Strategies , 2018, NeurIPS.

[33]  Felipe Petroski Such,et al.  Efficient transfer learning and online adaptation with latent variable models for continuous control , 2018, ArXiv.

[34]  Zeb Kurth-Nelson,et al.  Learning to reinforcement learn , 2016, CogSci.

[35]  Marcin Andrychowicz,et al.  One-Shot Imitation Learning , 2017, NIPS.

[36]  Mike Wu,et al.  Meta-Amortized Variational Inference and Learning , 2019, AAAI.

[37]  Shimon Whiteson,et al.  Deep Variational Reinforcement Learning for POMDPs , 2018, ICML.

[38]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[39]  Pieter Abbeel,et al.  Evolved Policy Gradients , 2018, NeurIPS.

[40]  Pieter Abbeel,et al.  Meta-Learning with Temporal Convolutions , 2017, ArXiv.

[41]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[42]  KearnsMichael,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002 .

[43]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[44]  Katja Hofmann,et al.  Fast Context Adaptation via Meta-Learning , 2018, ICML.

[45]  Pieter Abbeel,et al.  Some Considerations on Learning to Explore via Meta-Reinforcement Learning , 2018, ICLR 2018.

[46]  Albin Cassirer,et al.  Randomized Prior Functions for Deep Reinforcement Learning , 2018, NeurIPS.

[47]  Sepp Hochreiter,et al.  Learning to Learn Using Gradient Descent , 2001, ICANN.

[48]  Shie Mannor,et al.  Contextual Markov Decision Processes , 2015, ArXiv.

[49]  Yee Whye Teh,et al.  Neural Processes , 2018, ArXiv.

[50]  Finale Doshi-Velez,et al.  Hidden Parameter Markov Decision Processes: A Semiparametric Regression Approach for Discovering Latent Task Parametrizations , 2013, IJCAI.

[51]  Peter Dayan,et al.  Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search , 2012, NIPS.

[52]  Yee Whye Teh,et al.  Meta-learning of Sequential Strategies , 2019, ArXiv.

[53]  Benjamin Van Roy,et al.  (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[54]  Alessandro Lazaric,et al.  Rewards and errors in multi-arm bandits for interactive education , 2016, NIPS 2016.

[55]  Emma Brunskill,et al.  Bayes-optimal reinforcement learning for discrete uncertainty domains , 2012, AAMAS.

[56]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[57]  Andrew G. Barto,et al.  Optimal learning: computational procedures for bayes-adaptive markov decision processes , 2002 .

[58]  Alexander J. Smola,et al.  Deep Sets , 2017, 1703.06114.

[59]  Shie Mannor,et al.  Bayesian Reinforcement Learning: A Survey , 2015, Found. Trends Mach. Learn..

[60]  Sergey Levine,et al.  Self-Consistent Trajectory Autoencoder: Hierarchical Reinforcement Learning with Trajectory Embeddings , 2018, ICML.

[61]  Leslie Pack Kaelbling,et al.  Bayesian Policy Search with Policy Priors , 2011, IJCAI.

[62]  Sergey Levine,et al.  Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables , 2019, ICML.

[63]  Tamim Asfour,et al.  ProMP: Proximal Meta-Policy Search , 2018, ICLR.

[64]  Nan Jiang,et al.  Contextual Decision Processes with low Bellman rank are PAC-Learnable , 2016, ICML.

[65]  Katja Hofmann,et al.  Meta Reinforcement Learning with Latent Variable Gaussian Processes , 2018, UAI.