Model-based Adversarial Meta-Reinforcement Learning

Meta-reinforcement learning (meta-RL) aims to learn from multiple training tasks the ability to adapt efficiently to unseen test tasks. Despite the success, existing meta-RL algorithms are known to be sensitive to the task distribution shift. When the test task distribution is different from the training task distribution, the performance may degrade significantly. To address this issue, this paper proposes Model-based Adversarial Meta-Reinforcement Learning (AdMRL), where we aim to minimize the worst-case sub-optimality gap -- the difference between the optimal return and the return that the algorithm achieves after adaptation -- across all tasks in a family of tasks, with a model-based approach. We propose a minimax objective and optimize it by alternating between learning the dynamics model on a fixed task and finding the adversarial task for the current model -- the task for which the policy induced by the model is maximally suboptimal. Assuming the family of tasks is parameterized, we derive a formula for the gradient of the suboptimality with respect to the task parameters via the implicit function theorem, and show how the gradient estimator can be efficiently implemented by the conjugate gradient method and a novel use of the REINFORCE estimator. We evaluate our approach on several continuous control benchmarks and demonstrate its efficacy in the worst-case performance over all tasks, the generalization power to out-of-distribution tasks, and in training and test time sample efficiency, over existing state-of-the-art meta-RL algorithms.

[1]  Barak A. Pearlmutter Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[2]  Alexander J. Smola,et al.  Meta-Q-Learning , 2020, ICLR.

[3]  Tristan Deleu,et al.  Curriculum in Gradient-Based Meta-Reinforcement Learning , 2020, ArXiv.

[4]  Sergey Levine,et al.  Deep Online Learning via Meta-Learning: Continual Adaptation for Model-Based RL , 2018, ICLR.

[5]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[6]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[7]  Sergey Levine,et al.  Unsupervised Meta-Learning for Reinforcement Learning , 2018, ArXiv.

[8]  Sebastian Thrun,et al.  Is Learning The n-th Thing Any Easier Than Learning The First? , 1995, NIPS.

[9]  Aravind Rajeswaran,et al.  A Game Theoretic Framework for Model Based Reinforcement Learning , 2020, ICML.

[10]  Vladlen Koltun,et al.  Deep Equilibrium Models , 2019, NeurIPS.

[11]  Honglak Lee,et al.  Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion , 2018, NeurIPS.

[12]  Sergey Levine,et al.  Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[13]  Sergey Levine,et al.  Meta-Reinforcement Learning Robust to Distributional Shift via Model Identification and Experience Relabeling , 2020, ArXiv.

[14]  Alexei A. Efros,et al.  Large-Scale Study of Curiosity-Driven Learning , 2018, ICLR.

[15]  Sergey Levine,et al.  When to Trust Your Model: Model-Based Policy Optimization , 2019, NeurIPS.

[16]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[17]  Peter L. Bartlett,et al.  RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning , 2016, ArXiv.

[18]  Sergey Levine,et al.  Guided Meta-Policy Search , 2019, NeurIPS.

[19]  Shimon Whiteson,et al.  VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning , 2020, ICLR.

[20]  Sergey Levine,et al.  Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[21]  Yee Whye Teh,et al.  Meta reinforcement learning as task inference , 2019, ArXiv.

[22]  Sergey Levine,et al.  Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning , 2018, ArXiv.

[23]  Balaraman Ravindran,et al.  EPOpt: Learning Robust Neural Network Policies Using Model Ensembles , 2016, ICLR.

[24]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[25]  Amos J. Storkey,et al.  Exploration by Random Network Distillation , 2018, ICLR.

[26]  Akshay Krishnamurthy,et al.  Reward-Free Exploration for Reinforcement Learning , 2020, ICML.

[27]  C A Nelson,et al.  Learning to Learn , 2017, Encyclopedia of Machine Learning and Data Mining.

[28]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[29]  Yuandong Tian,et al.  Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees , 2018, ICLR.

[30]  Tengyu Ma,et al.  Bootstrapping the Expressivity with Model-based Planning , 2019, ArXiv.

[31]  Pieter Abbeel,et al.  Benchmarking Model-Based Reinforcement Learning , 2019, ArXiv.

[32]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[33]  Yoshua Bengio,et al.  On the Optimization of a Synaptic Learning Rule , 2007 .

[34]  Sergey Levine,et al.  Learning to Adapt in Dynamic, Real-World Environments through Meta-Reinforcement Learning , 2018, ICLR.

[35]  Sergey Levine,et al.  Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables , 2019, ICML.

[36]  Tamim Asfour,et al.  ProMP: Proximal Meta-Policy Search , 2018, ICLR.

[37]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[38]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[39]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[40]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[41]  Zhenguo Li,et al.  Meta Reinforcement Learning with Task Embedding and Shared Policy , 2019, IJCAI.

[42]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[43]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[44]  Paul E. Utgoff,et al.  Shift of bias for inductive concept learning , 1984 .

[45]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[46]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[47]  Louis Kirsch,et al.  Improving Generalization in Meta Reinforcement Learning using Learned Objectives , 2020, ICLR.

[48]  Sepp Hochreiter,et al.  Learning to Learn Using Gradient Descent , 2001, ICANN.

[49]  M. Silberman Active Learning: 101 Strategies To Teach Any Subject. , 1996 .

[50]  Tengyu Ma,et al.  A Model-based Approach for Sample-efficient Multi-task Reinforcement Learning , 2019, ArXiv.

[51]  David A. Cohn,et al.  Training Connectionist Networks with Queries and Selective Sampling , 1989, NIPS.

[52]  Zeb Kurth-Nelson,et al.  Learning to reinforcement learn , 2016, CogSci.

[53]  Katja Hofmann,et al.  Meta Reinforcement Learning with Latent Variable Gaussian Processes , 2018, UAI.

[54]  Pieter Abbeel,et al.  Model-Ensemble Trust-Region Policy Optimization , 2018, ICLR.

[55]  Jimmy Ba,et al.  Exploring Model-based Planning with Policy Networks , 2019, ICLR.

[56]  Jiale Zhou,et al.  Learning Context-aware Task Reasoning for Efficient Meta-reinforcement Learning , 2020, AAMAS.