Reincarnating Reinforcement Learning: Reusing Prior Computation to Accelerate Progress

Learning tabula rasa, that is without any prior knowledge, is the prevalent workflow in reinforcement learning (RL) research. However, RL systems, when applied to large-scale settings, rarely operate tabula rasa. Such large-scale systems undergo multiple design or algorithmic changes during their development cycle and use ad hoc approaches for incorporating these changes without re-training from scratch, which would have been prohibitively expensive. Additionally, the inefficiency of deep RL typically excludes researchers without access to industrial-scale resources from tackling computationally-demanding problems. To address these issues, we present reincarnating RL as an alternative workflow, where prior computational work (e.g., learned policies) is reused or transferred between design iterations of an RL agent, or from one RL agent to another. As a step towards enabling reincarnating RL from any agent to any other agent, we focus on the specific setting of efficiently transferring an existing sub-optimal policy to a standalone valuebased RL agent. We find that existing approaches fail in this setting and propose a simple algorithm to address their limitations. Equipped with this algorithm, we demonstrate reincarnating RL’s gains over tabula rasa RL on Atari 2600 games, a challenging locomotion task, and the real-world problem of navigating stratospheric balloons. Overall, this work argues for an alternate approach to RL research, which we believe could significantly improve real-world RL adoption and help democratize it further.

[1]  Ross B. Girshick,et al.  Mask R-CNN , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[3]  S. Levine,et al.  Accelerating Online Reinforcement Learning with Offline Datasets , 2020, ArXiv.

[4]  Tom B. Brown,et al.  Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , 2022, ArXiv.

[5]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[6]  Mehmet Remzi Dogar,et al.  Planning with a Receding Horizon for Manipulation in Clutter Using a Learned Value Function , 2018, 2018 IEEE-RAS 18th International Conference on Humanoid Robots (Humanoids).

[7]  Quoc V. Le,et al.  A graph placement methodology for fast chip design , 2021, Nature.

[8]  Mark Rowland,et al.  Understanding and Preventing Capacity Loss in Reinforcement Learning , 2022, ICLR.

[9]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[10]  Reinaldo A. C. Bianchi,et al.  Heuristically Accelerated Q-Learning: A New Approach to Speed Up Reinforcement Learning , 2004, SBIA.

[11]  Tianqi Chen,et al.  Net2Net: Accelerating Learning via Knowledge Transfer , 2015, ICLR.

[12]  Georg Ostrovski,et al.  The Difficulty of Passive Learning in Deep Reinforcement Learning , 2021, NeurIPS.

[13]  Sen Wang,et al.  Learning with Training Wheels: Speeding up Training with a Simple Controller for Deep Reinforcement Learning , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[14]  Ruslan Salakhutdinov,et al.  Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning , 2015, ICLR.

[15]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[16]  Krzysztof Choromanski,et al.  MLGO: a Machine Learning Guided Compiler Optimizations Framework , 2021, ArXiv.

[17]  Doina Precup,et al.  Towards Continual Reinforcement Learning: A Review and Perspectives , 2020, ArXiv.

[18]  S. Levine,et al.  Conservative Q-Learning for Offline Reinforcement Learning , 2020, NeurIPS.

[19]  Marcin Andrychowicz,et al.  Solving Rubik's Cube with a Robot Hand , 2019, ArXiv.

[20]  Pieter Abbeel,et al.  Offline-to-Online Reinforcement Learning via Balanced Replay and Pessimistic Q-Ensemble , 2021, CoRL.

[21]  Sergey Levine,et al.  Offline Reinforcement Learning with Implicit Q-Learning , 2021, ICLR.

[22]  Alexey Skrynnik,et al.  Forgetful experience replay in hierarchical reinforcement learning from expert demonstrations , 2021, Knowl. Based Syst..

[23]  Johan S. Obando-Ceron,et al.  Revisiting Rainbow: Promoting more insightful and inclusive deep reinforcement learning research , 2020, ICML.

[24]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[25]  Marlos C. Machado,et al.  Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents , 2017, J. Artif. Intell. Res..

[26]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[27]  Geoffrey E. Hinton,et al.  Big Self-Supervised Models are Strong Semi-Supervised Learners , 2020, NeurIPS.

[28]  Rishabh Agarwal,et al.  An Optimistic Perspective on Offline Reinforcement Learning , 2019, ICML.

[29]  Richard Tanburn,et al.  Making Efficient Use of Demonstrations to Solve Hard Exploration Problems , 2019, ICLR.

[30]  Charles Blundell,et al.  Beyond Fine-Tuning: Transferring Behavior in Reinforcement Learning , 2021, 2102.13515.

[31]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[32]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[33]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[34]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[35]  Yoshua Bengio,et al.  Revisiting Fundamentals of Experience Replay , 2020, ICML.

[36]  Marc G. Bellemare,et al.  Deep Reinforcement Learning at the Edge of the Statistical Precipice , 2021, NeurIPS.

[37]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  S. Levine,et al.  Jump-Start Reinforcement Learning , 2022, ICML.

[39]  Martin A. Riedmiller,et al.  Batch Reinforcement Learning , 2012, Reinforcement Learning.

[40]  S. Levine,et al.  Never Stop Learning: The Effectiveness of Fine-Tuning in Robotic Reinforcement Learning , 2020, CoRL.

[41]  S. Levine,et al.  Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems , 2020, ArXiv.

[42]  Sergey Levine,et al.  AW-Opt: Learning Robotic Skills with Imitation and Reinforcement at Scale , 2021, CoRL.

[43]  Sergey Levine,et al.  Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning , 2020, ICLR.

[44]  R. Fergus,et al.  Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels , 2020, ICLR.

[45]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[46]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[47]  Sameera S. Ponda,et al.  Autonomous navigation of stratospheric balloons using reinforcement learning , 2020, Nature.

[48]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[49]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[50]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[51]  Andrey Kolobov,et al.  Heuristic-Guided Reinforcement Learning , 2021, NeurIPS.

[52]  Marc G. Bellemare,et al.  Dopamine: A Research Framework for Deep Reinforcement Learning , 2018, ArXiv.

[53]  Rémi Munos,et al.  Recurrent Experience Replay in Distributed Reinforcement Learning , 2018, ICLR.

[54]  Matthew E. Taylor,et al.  Teaching on a budget: agents advising agents in reinforcement learning , 2013, AAMAS.

[55]  A Survey of Generalisation in Deep Reinforcement Learning , 2021, ArXiv.

[56]  Manuela M. Veloso,et al.  Probabilistic policy reuse in a reinforcement learning agent , 2006, AAMAS '06.

[57]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[58]  Stefan Schaal,et al.  Learning from Demonstration , 1996, NIPS.

[59]  Max Jaderberg,et al.  Population Based Training of Neural Networks , 2017, ArXiv.

[60]  A data-driven approach for learning to control computers , 2022, 2202.08137.

[61]  Jakub W. Pachocki,et al.  Dota 2 with Large Scale Deep Reinforcement Learning , 2019, ArXiv.

[62]  Marcin Andrychowicz,et al.  Overcoming Exploration in Reinforcement Learning with Demonstrations , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[63]  Sergio Gomez Colmenarejo,et al.  Acme: A Research Framework for Distributed Reinforcement Learning , 2020, ArXiv.

[64]  Yang Gao,et al.  Reinforcement Learning from Imperfect Demonstrations , 2018, ICLR.

[65]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[66]  Leslie Pack Kaelbling,et al.  Effective reinforcement learning for mobile robots , 2002, Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No.02CH37292).

[67]  Byron Boots,et al.  Truncated Horizon Policy Search: Combining Reinforcement Learning & Imitation Learning , 2018, ICLR.

[68]  Matthew W. Hoffman,et al.  Distributed Distributional Deterministic Policy Gradients , 2018, ICLR.

[69]  Rémi Munos,et al.  Implicit Quantile Networks for Distributional Reinforcement Learning , 2018, ICML.

[70]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[71]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[72]  Tom Schaul,et al.  Deep Q-learning From Demonstrations , 2017, AAAI.

[73]  Andrew Zisserman,et al.  Kickstarting Deep Reinforcement Learning , 2018, ArXiv.

[74]  John Langford,et al.  Learning to Search Better than Your Teacher , 2015, ICML.

[75]  Peter Stone,et al.  Agents teaching agents: a survey on inter-agent transfer learning , 2019, Autonomous Agents and Multi-Agent Systems.

[76]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[77]  Marc G. Bellemare,et al.  Distributional Reinforcement Learning with Quantile Regression , 2017, AAAI.

[78]  Razvan Pascanu,et al.  Distilling Policy Distillation , 2019, AISTATS.