Decoupling Exploration and Exploitation for Meta-Reinforcement Learning without Sacrifices

The goal of meta-reinforcement learning (metaRL) is to build agents that can quickly learn new tasks by leveraging prior experience on related tasks. Learning a new task often requires both exploring to gather task-relevant information and exploiting this information to solve the task. In principle, optimal exploration and exploitation can be learned end-to-end by simply maximizing task performance. However, such meta-RL approaches struggle with local optima due to a chicken-and-egg problem: learning to explore requires good exploitation to gauge the exploration’s utility, but learning to exploit requires information gathered via exploration. Optimizing separate objectives for exploration and exploitation can avoid this problem, but prior meta-RL exploration objectives yield suboptimal policies that gather information irrelevant to the task. We alleviate both concerns by constructing an exploitation objective that automatically identifies taskrelevant information and an exploration objective to recover only this information. This avoids local optima in end-to-end training, without sacrificing optimal exploration. Empirically, DREAM substantially outperforms existing approaches on complex meta-RL problems, such as sparsereward 3D visual navigation. Videos of DREAM: https://ezliu.github.io/dream/

[1]  Yee Whye Teh,et al.  Meta reinforcement learning as task inference , 2019, ArXiv.

[2]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[3]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[4]  C A Nelson,et al.  Learning to Learn , 2017, Encyclopedia of Machine Learning and Data Mining.

[5]  Kevin Swersky,et al.  An Imitation Learning Approach for Cache Replacement , 2020, ICML.

[6]  Changjie Fan,et al.  Learn to Effectively Explore in Context-Based Meta-RL , 2020, ArXiv.

[7]  Dale Schuurmans,et al.  Learning to Generalize from Sparse and Underspecified Rewards , 2019, ICML.

[8]  Marcin Andrychowicz,et al.  Learning to learn by gradient descent by gradient descent , 2016, NIPS.

[9]  Zeb Kurth-Nelson,et al.  Learning to reinforcement learn , 2016, CogSci.

[10]  Sergey Levine,et al.  Learning to Adapt in Dynamic, Real-World Environments through Meta-Reinforcement Learning , 2018, ICLR.

[11]  Sergey Levine,et al.  Watch, Try, Learn: Meta-Learning from Demonstrations and Reward , 2019, ICLR.

[12]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[13]  Peter L. Bartlett,et al.  RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning , 2016, ArXiv.

[14]  Katja Hofmann,et al.  Meta Reinforcement Learning with Latent Variable Gaussian Processes , 2018, UAI.

[15]  Atil Iscen,et al.  NoRML: No-Reward Meta Learning , 2019, AAMAS.

[16]  Pieter Abbeel,et al.  Evolved Policy Gradients , 2018, NeurIPS.

[17]  S. Levine,et al.  Guided Meta-Policy Search , 2019, NeurIPS.

[18]  Percy Liang,et al.  Learning Abstract Models for Strategic Exploration and Fast Reward Transfer , 2020, ArXiv.

[19]  Pieter Abbeel,et al.  The Importance of Sampling inMeta-Reinforcement Learning , 2018, NeurIPS.

[20]  Yoshua Bengio,et al.  Learning a synaptic learning rule , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[21]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[22]  Richard J. Mammone,et al.  Meta-neural networks that learn by learning , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[23]  Aviv Tamar,et al.  Offline Meta Reinforcement Learning , 2020, ArXiv.

[24]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[25]  David Barber,et al.  The IM algorithm: a variational approach to Information Maximization , 2003, NIPS 2003.

[26]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[27]  Sergey Levine,et al.  Diversity is All You Need: Learning Skills without a Reward Function , 2018, ICLR.

[28]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[29]  David Warde-Farley,et al.  Unsupervised Control Through Non-Parametric Discriminative Rewards , 2018, ICLR.

[30]  Katia Sycara,et al.  MAME : Model-Agnostic Meta-Exploration , 2019, CoRL.

[31]  Rémi Munos,et al.  Recurrent Experience Replay in Distributed Reinforcement Learning , 2018, ICLR.

[32]  Daan Wierstra,et al.  Variational Intrinsic Control , 2016, ICLR.

[33]  Pieter Abbeel,et al.  A Simple Neural Attentive Meta-Learner , 2017, ICLR.

[34]  Shimon Whiteson,et al.  VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning , 2020, ICLR.

[35]  Alexander A. Alemi,et al.  Deep Variational Information Bottleneck , 2017, ICLR.

[36]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[37]  Sergey Levine,et al.  Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables , 2019, ICML.

[38]  Jordi Grau-Moya,et al.  A Unified Bellman Optimality Principle Combining Reward Maximization and Empowerment , 2019, NeurIPS.

[39]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[40]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[41]  Benjamin Van Roy,et al.  A Tutorial on Thompson Sampling , 2017, Found. Trends Mach. Learn..

[42]  Daan Wierstra,et al.  One-shot Learning with Memory-Augmented Neural Networks , 2016, ArXiv.

[43]  Sergey Levine,et al.  Meta-Reinforcement Learning of Structured Exploration Strategies , 2018, NeurIPS.

[44]  Voot Tangkaratt,et al.  Meta-Model-Based Meta-Policy Optimization , 2020, ArXiv.

[45]  Sergey Levine,et al.  Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning , 2019, CoRL.

[46]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[47]  Ludovic Denoyer,et al.  Learning Adaptive Exploration Strategies in Dynamic Environments Through Informed Policy Regularization , 2020, ArXiv.

[48]  Amos J. Storkey,et al.  Exploration by Random Network Distillation , 2018, ICLR.

[49]  Sepp Hochreiter,et al.  Learning to Learn Using Gradient Descent , 2001, ICANN.

[50]  Yoshua Bengio,et al.  On the Optimization of a Synaptic Learning Rule , 2007 .

[51]  Abhinav Gupta,et al.  Environment Probing Interaction Policies , 2019, ICLR.

[52]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .