Adaptive Behavior Cloning Regularization for Stable Offline-to-Online Reinforcement Learning

Offline reinforcement learning, by learning from a fixed dataset, makes it possible to learn agent behaviors without interacting with the environment. However, depending on the quality of the offline dataset, such pre-trained agents may have limited performance and would further need to be fine-tuned online by interacting with the environment. During online fine-tuning, the performance of the pre-trained agent may collapse quickly due to the sudden distribution shift from offline to online data. While constraints enforced by offline RL methods such as a behaviour cloning loss prevent this to an extent, these constraints also significantly slow down online fine-tuning by forcing the agent to stay close to the behavior policy. We propose to adaptively weigh the behavior cloning loss during online fine-tuning based on the agent's performance and training stability. Moreover, we use a randomized ensemble of Q functions to further increase the sample efficiency of online fine-tuning by performing a large number of learning updates. Experiments show that the proposed method yields state-of-the-art offline-to-online reinforcement learning performance on the popular D4RL benchmark. Code is available: \url{https://github.com/zhaoyi11/adaptive_bc}.

[1]  Pieter Abbeel,et al.  Offline-to-Online Reinforcement Learning via Balanced Replay and Pessimistic Q-Ensemble , 2021, CoRL.

[2]  Scott Fujimoto,et al.  A Minimalist Approach to Offline Reinforcement Learning , 2021, NeurIPS.

[3]  Ilya Kostrikov,et al.  Offline Reinforcement Learning with Fisher Divergence Critic Regularization , 2021, ICML.

[4]  Ofir Nachum,et al.  Representation Matters: Offline Pretraining for Sequential Decision Making , 2021, ICML.

[5]  Che Wang,et al.  Randomized Ensembled Double Q-Learning: Learning Fast Without a Model , 2021, ICLR.

[6]  Sergey Levine,et al.  COG: Connecting New Skills to Past Experience with Offline Reinforcement Learning , 2020, ArXiv.

[7]  S. Levine,et al.  OPAL: Offline Primitive Discovery for Accelerating Offline Reinforcement Learning , 2020, ICLR.

[8]  Gabriel Dulac-Arnold,et al.  Model-Based Offline Planning , 2020, ICLR.

[9]  S. Levine,et al.  Accelerating Online Reinforcement Learning with Offline Datasets , 2020, ArXiv.

[10]  S. Levine,et al.  Conservative Q-Learning for Offline Reinforcement Learning , 2020, NeurIPS.

[11]  Lantao Yu,et al.  MOPO: Model-based Offline Policy Optimization , 2020, NeurIPS.

[12]  T. Joachims,et al.  MOReL : Model-Based Offline Reinforcement Learning , 2020, NeurIPS.

[13]  S. Levine,et al.  Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems , 2020, ArXiv.

[14]  Justin Fu,et al.  D4RL: Datasets for Deep Data-Driven Reinforcement Learning , 2020, ArXiv.

[15]  Martin A. Riedmiller,et al.  Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning , 2020, ICLR.

[16]  Martha White,et al.  Maxmin Q-learning: Controlling the Estimation Bias of Q-learning , 2020, ICLR.

[17]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[18]  Krzysztof Choromanski,et al.  Ready Policy One: World Building Through Active Learning , 2020, ICML.

[19]  Sergey Levine,et al.  Relay Policy Learning: Solving Long-Horizon Tasks via Imitation and Reinforcement Learning , 2019, CoRL.

[20]  Sergey Levine,et al.  Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning , 2019, ArXiv.

[21]  Yifan Wu,et al.  Behavior Regularized Offline Reinforcement Learning , 2019, ArXiv.

[22]  Rishabh Agarwal,et al.  An Optimistic Perspective on Offline Reinforcement Learning , 2019, ICML.

[23]  Sergey Levine,et al.  When to Trust Your Model: Model-Based Policy Optimization , 2019, NeurIPS.

[24]  Sergey Levine,et al.  Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction , 2019, NeurIPS.

[25]  Bo Dai,et al.  DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections , 2019, NeurIPS.

[26]  Henry Zhu,et al.  Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[27]  Doina Precup,et al.  Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[28]  Sergey Levine,et al.  QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation , 2018, CoRL.

[29]  Sergey Levine,et al.  Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[30]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[31]  Sergey Levine,et al.  Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations , 2017, Robotics: Science and Systems.

[32]  Nahum Shimkin,et al.  Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning , 2016, ICML.

[33]  J. Schulman,et al.  OpenAI Gym , 2016, ArXiv.

[34]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[35]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[36]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[37]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[38]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[39]  F. Schwenker,et al.  Neural Network Ensembles in Reinforcement Learning , 2015, Neural Processing Letters.

[40]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[41]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[42]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[43]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[44]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[45]  Martha White,et al.  Linear Off-Policy Actor-Critic , 2012, ICML.

[46]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[47]  Peter Secretan Learning , 1965, Mental Health.

[48]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[49]  Martin A. Riedmiller,et al.  Batch Reinforcement Learning , 2012, Reinforcement Learning.

[50]  Vijay R. Konda,et al.  Actor-Critic Algorithms , 1999, NIPS.