Beyond Fine-Tuning: Transferring Behavior in Reinforcement Learning

Designing agents that acquire knowledge autonomously and use it to solve new tasks efficiently is an important challenge in reinforcement learning. Knowledge acquired during an unsupervised pre-training phase is often transferred by finetuning neural network weights once rewards are exposed, as is common practice in supervised domains. Given the nature of the reinforcement learning problem, we argue that standard fine-tuning strategies alone are not enough for efficient transfer in challenging domains. We introduce Behavior Transfer (BT), a technique that leverages pre-trained policies for exploration and that is complementary to transferring neural network weights. Our experiments show that, when combined with large-scale pre-training in the absence of rewards, existing intrinsic motivation objectives can lead to the emergence of complex behaviors. These pre-trained policies can then be leveraged by BT to discover better solutions than without pre-training, and combining BT with standard fine-tuning strategies results in additional benefits. The largest gains are generally observed in domains requiring structured exploration, including settings where the behavior of the pre-trained policies is misaligned with the downstream task.

[1]  Jing Peng,et al.  Incremental multi-step Q-learning , 1994, Machine Learning.

[2]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[3]  Marcello Restelli,et al.  Task-Agnostic Exploration via Policy Gradient of a Non-Parametric State Entropy Estimate , 2021, AAAI.

[4]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[5]  Richard Socher,et al.  Explore, Discover and Learn: Unsupervised Discovery of State-Covering Skills , 2020, ICML.

[6]  Allan Jabri,et al.  Unsupervised Curricula for Visual Meta-Reinforcement Learning , 2019, NeurIPS.

[7]  Sergey Levine,et al.  Model-Based Reinforcement Learning for Atari , 2019, ICLR.

[8]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[9]  Ilya Kostrikov,et al.  Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels , 2020, ArXiv.

[10]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[11]  Christoph Salge,et al.  Empowerment - an Introduction , 2013, ArXiv.

[12]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[13]  Pieter Abbeel,et al.  Planning to Explore via Self-Supervised World Models , 2020, ICML.

[14]  Sergey Levine,et al.  Unsupervised Meta-Learning for Reinforcement Learning , 2018, ArXiv.

[15]  Tom Schaul,et al.  Successor Features for Transfer in Reinforcement Learning , 2016, NIPS.

[16]  Pieter Abbeel,et al.  Behavior From the Void: Unsupervised Active Pre-Training , 2021, ArXiv.

[17]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[18]  Georg Ostrovski,et al.  Temporally-Extended ε-Greedy Exploration , 2020, ICLR.

[19]  Yao Liu,et al.  When Simple Exploration is Sample Efficient: Identifying Sufficient Conditions for Random Exploration to Yield PAC RL Algorithms , 2018, ArXiv.

[20]  Steven Latré,et al.  Learning Intrinsically Motivated Options to Stimulate Policy Exploration , 2020 .

[21]  Sham M. Kakade,et al.  Provably Efficient Maximum Entropy Exploration , 2018, ICML.

[22]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[23]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[24]  Doina Precup,et al.  Fast reinforcement learning with generalized policy updates , 2020, Proceedings of the National Academy of Sciences.

[25]  Mohammad Norouzi,et al.  Dream to Control: Learning Behaviors by Latent Imagination , 2019, ICLR.

[26]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[27]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[28]  Michal Valko,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[29]  Sergey Levine,et al.  Diversity is All You Need: Learning Skills without a Reward Function , 2018, ICLR.

[30]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[31]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[32]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Tom Schaul,et al.  Transfer in Deep Reinforcement Learning Using Successor Features and Generalised Policy Improvement , 2018, ICML.

[34]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  David Warde-Farley,et al.  Fast Task Inference with Variational Intrinsic Successor Features , 2019, ICLR.

[36]  Tom Schaul,et al.  Universal Successor Features Approximators , 2018, ICLR.

[37]  MahadevanSridhar,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003 .

[38]  Alexei A. Efros,et al.  Large-Scale Study of Curiosity-Driven Learning , 2018, ICLR.

[39]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[40]  Daniel Guo,et al.  Agent57: Outperforming the Atari Human Benchmark , 2020, ICML.

[41]  Rémi Munos,et al.  Recurrent Experience Replay in Distributed Reinforcement Learning , 2018, ICLR.

[42]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[43]  Pieter Abbeel,et al.  Decoupling Representation Learning from Reinforcement Learning , 2020, ICML.

[44]  Peter L. Bartlett,et al.  RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning , 2016, ArXiv.

[45]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[46]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[47]  Daniel Guo,et al.  Never Give Up: Learning Directed Exploration Strategies , 2020, ICLR.

[48]  Pieter Abbeel,et al.  Stochastic Neural Networks for Hierarchical Reinforcement Learning , 2016, ICLR.

[49]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[50]  P. A. Prince,et al.  Lévy flight search patterns of wandering albatrosses , 1996, Nature.

[51]  Jürgen Schmidhuber,et al.  Recurrent World Models Facilitate Policy Evolution , 2018, NeurIPS.

[52]  Sergey Levine,et al.  Efficient Exploration via State Marginal Matching , 2019, ArXiv.

[53]  Shakir Mohamed,et al.  Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning , 2015, NIPS.

[54]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[55]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[56]  Alessandro Lazaric,et al.  Reinforcement Learning with Prototypical Representations , 2021, ICML.

[57]  Benjamin Van Roy,et al.  Generalization and Exploration via Randomized Value Functions , 2014, ICML.