Improving Generalization in Meta Reinforcement Learning using Learned Objectives

Biological evolution has distilled the experiences of many learners into the general learning algorithms of humans. Our novel meta reinforcement learning algorithm MetaGenRL is inspired by this process. MetaGenRL distills the experiences of many complex agents to meta-learn a low-complexity neural objective function that decides how future individuals will learn. Unlike recent meta-RL algorithms, MetaGenRL can generalize to new environments that are entirely different from those used for meta-training. In some cases, it even outperforms human-engineered RL algorithms. MetaGenRL uses off-policy second-order gradients during meta-training that greatly increase its sample efficiency.

[1]  Jürgen Schmidhuber,et al.  A ‘Self-Referential’ Weight Matrix , 1993 .

[2]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[3]  Jascha Sohl-Dickstein,et al.  Learning Unsupervised Learning Rules , 2018, ArXiv.

[4]  Pieter Abbeel,et al.  Meta-Learning with Temporal Convolutions , 2017, ArXiv.

[5]  Satinder Singh,et al.  On Learning Intrinsic Rewards for Policy Gradient Methods , 2018, NeurIPS.

[6]  Yevgen Chebotar,et al.  Meta Learning via Learned Loss , 2019, 2020 25th International Conference on Pattern Recognition (ICPR).

[7]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[8]  Pieter Abbeel,et al.  Evolved Policy Gradients , 2018, NeurIPS.

[9]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[10]  Peter L. Bartlett,et al.  RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning , 2016, ArXiv.

[11]  Yoshua Bengio,et al.  Learning a synaptic learning rule , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[12]  Lee Spector,et al.  Evolution of reward functions for reinforcement learning , 2011, GECCO.

[13]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[14]  Guy Lever,et al.  Human-level performance in 3D multiplayer games with population-based reinforcement learning , 2018, Science.

[15]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[16]  Juergen Schmidhuber,et al.  On learning how to learn learning strategies , 1994 .

[17]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[18]  Sepp Hochreiter,et al.  Learning to Learn Using Gradient Descent , 2001, ICANN.

[19]  Sergey Levine,et al.  One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning , 2018, Robotics: Science and Systems.

[20]  Michael I. Jordan,et al.  RLlib: Abstractions for Distributed Reinforcement Learning , 2017, ICML.

[21]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[22]  R. J. Williams,et al.  On the use of backpropagation in associative reinforcement learning , 1988, IEEE 1988 International Conference on Neural Networks.

[23]  Razvan Pascanu,et al.  Imagination-Augmented Agents for Deep Reinforcement Learning , 2017, NIPS.

[24]  Pieter Abbeel,et al.  A Simple Neural Attentive Meta-Learner , 2017, ICLR.

[25]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[26]  Marcin Andrychowicz,et al.  Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research , 2018, ArXiv.

[27]  Marcin Andrychowicz,et al.  Learning to learn by gradient descent by gradient descent , 2016, NIPS.

[28]  Jitendra Malik,et al.  Learning to Optimize Neural Nets , 2017, ArXiv.

[29]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[30]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[31]  Li Zhang,et al.  Learning to Learn: Meta-Critic Networks for Sample Efficient Learning , 2017, ArXiv.

[32]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[33]  Jürgen Schmidhuber,et al.  Learning to forget: continual prediction with LSTM , 1999 .

[34]  Marco Wiering,et al.  HQ-Learning: Discovering Markovian Subgoals for Non-Markovian Reinforcement Learning , 1996 .

[35]  Jürgen Schmidhuber,et al.  Learning to generate sub-goals for action sequences , 1991 .

[36]  John Schulman,et al.  Gotta Learn Fast: A New Benchmark for Generalization in RL , 2018, ArXiv.

[37]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[38]  Wulfram Gerstner,et al.  Reduction of the Hodgkin-Huxley Equations to a Single-Variable Threshold Model , 1997, Neural Computation.

[39]  Stewart W. Wilson,et al.  A Possibility for Implementing Curiosity and Boredom in Model-Building Neural Controllers , 1991 .

[40]  Zeb Kurth-Nelson,et al.  Learning to reinforcement learn , 2016, CogSci.

[41]  Ruslan Salakhutdinov,et al.  Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning , 2015, ICLR.

[42]  Jeff Clune,et al.  AI-GAs: AI-generating algorithms, an alternate paradigm for producing general artificial intelligence , 2019, ArXiv.

[43]  Sergey Levine,et al.  Meta-Learning and Universality: Deep Representations and Gradient Descent can Approximate any Learning Algorithm , 2017, ICLR.

[44]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[45]  Ion Stoica,et al.  Ray RLLib: A Composable and Scalable Reinforcement Learning Library , 2017, NIPS 2017.

[46]  Nuttapong Chentanez,et al.  Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[47]  Jitendra Malik,et al.  Learning to Optimize , 2016, ICLR.

[48]  Daan Wierstra,et al.  One-shot Learning with Memory-Augmented Neural Networks , 2016, ArXiv.

[49]  Thomas L. Griffiths,et al.  Recasting Gradient-Based Meta-Learning as Hierarchical Bayes , 2018, ICLR.

[50]  David Silver,et al.  Meta-Gradient Reinforcement Learning , 2018, NeurIPS.

[51]  Jürgen Schmidhuber,et al.  Recurrent World Models Facilitate Policy Evolution , 2018, NeurIPS.

[52]  Razvan Pascanu,et al.  Policy Distillation , 2015, ICLR.

[53]  Tom Schaul,et al.  Natural Evolution Strategies , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[54]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[55]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[56]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[57]  Razvan Pascanu,et al.  Meta-Learning with Latent Embedding Optimization , 2018, ICLR.

[58]  Jieyu Zhao,et al.  Direct Policy Search and Uncertain Policy Evaluation , 1998 .

[59]  Leslie Pack Kaelbling,et al.  Meta-learning curiosity algorithms , 2020, ICLR.

[60]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[61]  Yoshua Bengio,et al.  Bayesian Model-Agnostic Meta-Learning , 2018, NeurIPS.