论文信息 - VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning

VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning

Trading off exploration and exploitation in an unknown environment is key to maximising expected return during learning. A Bayes-optimal policy, which does so optimally, conditions its actions not only on the environment state but on the agent's uncertainty about the environment. Computing a Bayes-optimal policy is however intractable for all but the smallest tasks. In this paper, we introduce variational Bayes-Adaptive Deep RL (variBAD), a way to meta-learn to perform approximate inference in an unknown environment, and incorporate task uncertainty directly during action selection. In a grid-world domain, we illustrate how variBAD performs structured online exploration as a function of task uncertainty. We further evaluate variBAD on MuJoCo domains widely used in meta-RL and show that it achieves higher online return than existing methods.

[1] Li Zhang,et al. Learning to Learn: Meta-Critic Networks for Sample Efficient Learning , 2017, ArXiv.

[2] Karol Hausman,et al. Learning an Embedding Space for Transferable Robot Skills , 2018, ICLR.

[3] Zoran Popovic,et al. Trading Off Scientific Knowledge and User Learning with Multi-Armed Bandits , 2014, EDM.

[4] Peter Dayan,et al. Scalable and Efficient Bayes-Adaptive Reinforcement Learning Based on Monte-Carlo Tree Search , 2013, J. Artif. Intell. Res..

[5] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[6] Lihong Li,et al. Policy Certificates: Towards Accountable Reinforcement Learning , 2018, ICML.

[7] P. Randolph. Bayesian Decision Problems and Markov Chains , 1968 .

[8] Danica Kragic,et al. VPE: Variational Policy Embedding for Transfer Reinforcement Learning , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[9] Zhenguo Li,et al. Meta Reinforcement Learning with Task Embedding and Shared Policy , 2019, IJCAI.

[10] Pratik Shah,et al. Reinforcement Learning with Action-Derived Rewards for Chemotherapy and Clinical Trial Dosing Regimen Selection , 2018, MLHC.

[11] Yee Whye Teh,et al. Meta reinforcement learning as task inference , 2019, ArXiv.

[12] R. Bellman. A PROBLEM IN THE SEQUENTIAL DESIGN OF EXPERIMENTS , 1954 .

[13] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .

[14] Sebastian Nowozin,et al. Meta-Learning Probabilistic Inference for Prediction , 2018, ICLR.

[15] Leslie Pack Kaelbling,et al. Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[16] Peter L. Bartlett,et al. RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning , 2016, ArXiv.

[17] Jesse Hoey,et al. An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[18] Andrew Y. Ng,et al. Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[19] Finale Doshi-Velez,et al. Robust and Efficient Transfer Learning with Hidden Parameter Markov Decision Processes , 2017, AAAI.

[20] Malcolm J. A. Strens,et al. A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[21] Sergey Levine,et al. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[22] Joelle Pineau,et al. Decoupling Dynamics and Reward for Transfer Learning , 2018, ICLR.

[23] J. Schulman,et al. Reptile: a Scalable Metalearning Algorithm , 2018 .

[24] Michael L. Littman,et al. Learning is planning: near Bayes-optimal reinforcement learning via Monte-Carlo tree search , 2011, UAI.

[25] Nando de Freitas,et al. Robust Imitation of Diverse Behaviors , 2017, NIPS.

[26] Leslie Pack Kaelbling,et al. Acting Optimally in Partially Observable Stochastic Domains , 1994, AAAI.

[27] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[28] Katja Hofmann,et al. Variational Inference for Data-Efficient Model Learning in POMDPs , 2018, ArXiv.

[29] Siddhartha S. Srinivasa,et al. Bayesian Policy Optimization for Model Uncertainty , 2018, ICLR.

[30] Ambuj Tewari,et al. Contextual Markov Decision Processes using Generalized Linear Models , 2019, ArXiv.

[31] Richard L. Lewis,et al. Variance-Based Rewards for Approximate Bayesian Reinforcement Learning , 2010, UAI.

[32] Sergey Levine,et al. Meta-Reinforcement Learning of Structured Exploration Strategies , 2018, NeurIPS.

[33] Felipe Petroski Such,et al. Efficient transfer learning and online adaptation with latent variable models for continuous control , 2018, ArXiv.

[34] Zeb Kurth-Nelson,et al. Learning to reinforcement learn , 2016, CogSci.

[35] Marcin Andrychowicz,et al. One-Shot Imitation Learning , 2017, NIPS.

[36] Mike Wu,et al. Meta-Amortized Variational Inference and Learning , 2019, AAAI.

[37] Shimon Whiteson,et al. Deep Variational Reinforcement Learning for POMDPs , 2018, ICML.

[38] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[39] Pieter Abbeel,et al. Evolved Policy Gradients , 2018, NeurIPS.

[40] Pieter Abbeel,et al. Meta-Learning with Temporal Convolutions , 2017, ArXiv.

[41] Ronen I. Brafman,et al. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[42] KearnsMichael,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 2002 .

[43] W. R. Thompson. ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[44] Katja Hofmann,et al. Fast Context Adaptation via Meta-Learning , 2018, ICML.

[45] Pieter Abbeel,et al. Some Considerations on Learning to Explore via Meta-Reinforcement Learning , 2018, ICLR 2018.

[46] Albin Cassirer,et al. Randomized Prior Functions for Deep Reinforcement Learning , 2018, NeurIPS.

[47] Sepp Hochreiter,et al. Learning to Learn Using Gradient Descent , 2001, ICANN.

[48] Shie Mannor,et al. Contextual Markov Decision Processes , 2015, ArXiv.

[49] Yee Whye Teh,et al. Neural Processes , 2018, ArXiv.

[50] Finale Doshi-Velez,et al. Hidden Parameter Markov Decision Processes: A Semiparametric Regression Approach for Discovering Latent Task Parametrizations , 2013, IJCAI.