Explore then Execute: Adapting without Rewards via Factorized Meta-Reinforcement Learning

We seek to efficiently learn by leveraging shared structure between different tasks and environments. For example, cooking is similar in different kitchens, even though the ingredients may change location. In principle, meta-reinforcement learning approaches can exploit this shared structure, but in practice, they fail to adapt to new environments when adaptation requires targeted exploration (e.g., exploring the cabinets to find ingredients in a new kitchen). We show that existing approaches fail due to a chicken-and-egg problem: learning what to explore requires knowing what information is critical for solving the task, but learning to solve the task requires already gathering this information via exploration. For example, exploring to find the ingredients only helps a robot prepare a meal if it already knows how to cook, but the robot can only learn to cook if it already knows where the ingredients are. To address this, we propose a new exploration objective (DREAM), based on identifying key information in the environment, independent of how this information will exactly be used solve the task. By decoupling exploration from task execution, DREAM explores and consequently adapts to new environments, requiring no reward signal when the task is specified via an instruction. Empirically, DREAM scales to more complex problems, such as sparse-reward 3D visual navigation, while existing approaches fail from insufficient exploration.

[1]  Katia Sycara,et al.  MAME : Model-Agnostic Meta-Exploration , 2019, CoRL.

[2]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[3]  Peter L. Bartlett,et al.  RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning , 2016, ArXiv.

[4]  Yoshua Bengio,et al.  On the Optimization of a Synaptic Learning Rule , 2007 .

[5]  Ludovic Denoyer,et al.  Learning Adaptive Exploration Strategies in Dynamic Environments Through Informed Policy Regularization , 2020, ArXiv.

[6]  Dale Schuurmans,et al.  Learning to Generalize from Sparse and Underspecified Rewards , 2019, ICML.

[7]  Sergey Levine,et al.  Learning to Adapt in Dynamic, Real-World Environments through Meta-Reinforcement Learning , 2018, ICLR.

[8]  Marcin Andrychowicz,et al.  Learning to learn by gradient descent by gradient descent , 2016, NIPS.

[9]  Sergey Levine,et al.  Watch, Try, Learn: Meta-Learning from Demonstrations and Reward , 2019, ICLR.

[10]  Sergey Levine,et al.  Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables , 2019, ICML.

[11]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[12]  Abhinav Gupta,et al.  Environment Probing Interaction Policies , 2019, ICLR.

[13]  Tamim Asfour,et al.  ProMP: Proximal Meta-Policy Search , 2018, ICLR.

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  C A Nelson,et al.  Learning to Learn , 2017, Encyclopedia of Machine Learning and Data Mining.

[16]  Sergey Levine,et al.  Meta-Reinforcement Learning of Structured Exploration Strategies , 2018, NeurIPS.

[17]  Daan Wierstra,et al.  Variational Intrinsic Control , 2016, ICLR.

[18]  Yee Whye Teh,et al.  Meta reinforcement learning as task inference , 2019, ArXiv.

[19]  Pieter Abbeel,et al.  Meta-Learning with Temporal Convolutions , 2017, ArXiv.

[20]  Atil Iscen,et al.  NoRML: No-Reward Meta Learning , 2019, AAMAS.

[21]  Rémi Munos,et al.  Recurrent Experience Replay in Distributed Reinforcement Learning , 2018, ICLR.

[22]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[23]  Sergey Levine,et al.  Diversity is All You Need: Learning Skills without a Reward Function , 2018, ICLR.

[24]  Sepp Hochreiter,et al.  Learning to Learn Using Gradient Descent , 2001, ICANN.

[25]  Yoshua Bengio,et al.  Learning a synaptic learning rule , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[26]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[27]  Zeb Kurth-Nelson,et al.  Learning to reinforcement learn , 2016, CogSci.

[28]  Richard J. Mammone,et al.  Meta-neural networks that learn by learning , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[29]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[30]  Zeb Kurth-Nelson,et al.  Been There, Done That: Meta-Learning with Episodic Recall , 2018, ICML.

[31]  Benjamin Van Roy,et al.  A Tutorial on Thompson Sampling , 2017, Found. Trends Mach. Learn..

[32]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[33]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[34]  David Barber,et al.  The IM algorithm: a variational approach to Information Maximization , 2003, NIPS 2003.

[35]  Daan Wierstra,et al.  One-shot Learning with Memory-Augmented Neural Networks , 2016, ArXiv.

[36]  Alexander J. Smola,et al.  Meta-Q-Learning , 2020, ICLR.

[37]  David Warde-Farley,et al.  Unsupervised Control Through Non-Parametric Discriminative Rewards , 2018, ICLR.

[38]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[39]  Sergey Levine,et al.  Guided Meta-Policy Search , 2019, NeurIPS.

[40]  Shimon Whiteson,et al.  VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning , 2020, ICLR.

[41]  Katja Hofmann,et al.  Meta Reinforcement Learning with Latent Variable Gaussian Processes , 2018, UAI.

[42]  Pieter Abbeel,et al.  The Importance of Sampling inMeta-Reinforcement Learning , 2018, NeurIPS.

[43]  Pieter Abbeel,et al.  A Simple Neural Attentive Meta-Learner , 2017, ICLR.