Behavior Prior Representation learning for Offline Reinforcement Learning

Offline reinforcement learning (RL) struggles in environments with rich and noisy inputs, where the agent only has access to a fixed dataset without environment interactions. Past works have proposed common workarounds based on the pre-training of state representations, followed by policy training. In this work, we introduce a simple, yet effective approach for learning state representations. Our method, Behavior Prior Representation (BPR), learns state representations with an easy-to-integrate objective based on behavior cloning of the dataset: we first learn a state representation by mimicking actions from the dataset, and then train a policy on top of the fixed representation, using any off-the-shelf Offline RL algorithm. Theoretically, we prove that BPR carries out performance guarantees when integrated into algorithms that have either policy improvement guarantees (conservative algorithms) or produce lower bounds of the policy values (pessimistic algorithms). Empirically, we show that BPR combined with existing state-of-the-art Offline RL algorithms leads to significant improvements across several offline control benchmarks. The code is available at \url{https://github.com/bit1029public/offline_bpr}.

[1]  Michael A. Osborne,et al.  Challenges and Opportunities in Offline Reinforcement Learning from Visual Observations , 2022, ArXiv.

[2]  Alekh Agarwal,et al.  Provable Benefits of Representational Transfer in Reinforcement Learning , 2022, COLT.

[3]  Stuart J. Russell,et al.  An Empirical Investigation of Representation Learning for Imitation , 2022, NeurIPS Datasets and Benchmarks.

[4]  Mark Rowland,et al.  Understanding and Preventing Capacity Loss in Reinforcement Learning , 2022, ICLR.

[5]  Marlos C. Machado,et al.  Investigating the Properties of Neural Network Representations in Reinforcement Learning , 2022, ArXiv.

[6]  Adam M. Oberman,et al.  On the Generalization of Representations in Reinforcement Learning , 2022, AISTATS.

[7]  Xin Li,et al.  SimSR: Simple Distance-based State Representation for Deep Reinforcement Learning , 2021, AAAI.

[8]  Sergey Levine,et al.  DR3: Value-Based Deep Reinforcement Learning Requires Explicit Regularization , 2021, ICLR.

[9]  Dylan J. Foster,et al.  Offline Reinforcement Learning: Fundamental Barriers for Value Function Approximation , 2021, COLT.

[10]  Hyun Oh Song,et al.  Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble , 2021, NeurIPS.

[11]  Viktor K Prasanna,et al.  BRAC+: Improved Behavior Regularized Actor Critic for Offline Reinforcement Learning , 2021, ACML.

[12]  Marc G. Bellemare,et al.  Deep Reinforcement Learning at the Edge of the Statistical Precipice , 2021, NeurIPS.

[13]  Csaba Szepesvari,et al.  The Curse of Passive Data Collection in Batch Reinforcement Learning , 2021, AISTATS.

[14]  Pieter Abbeel,et al.  Behavioral Priors and Dynamics Models: Improving Performance and Domain Transfer in Offline RL , 2021, ArXiv.

[15]  Scott Fujimoto,et al.  A Minimalist Approach to Offline Reinforcement Learning , 2021, NeurIPS.

[16]  Prakash Panangaden,et al.  MICo: Improved representations via sampling-based state similarity for Markov decision processes , 2021, NeurIPS.

[17]  Romain Laroche,et al.  Multi-Objective SPIBB: Seldonian Offline Policy Improvement with Safety Constraints in Finite MDPs , 2021, NeurIPS.

[18]  Ofir Nachum,et al.  Provable Representation Learning for Imitation with Contrastive Fourier Features , 2021, NeurIPS.

[19]  Stuart J. Russell,et al.  Bridging Offline Reinforcement Learning and Imitation Learning: A Tale of Pessimism , 2021, IEEE Transactions on Information Theory.

[20]  Ilya Kostrikov,et al.  Offline Reinforcement Learning with Fisher Divergence Critic Regularization , 2021, ICML.

[21]  P. Abbeel,et al.  Behavior From the Void: Unsupervised Active Pre-Training , 2021, NeurIPS.

[22]  Alessandro Lazaric,et al.  Reinforcement Learning with Prototypical Representations , 2021, ICML.

[23]  Sergey Levine,et al.  COMBO: Conservative Offline Model-Based Policy Optimization , 2021, NeurIPS.

[24]  A. Krishnamurthy,et al.  Model-free Representation Learning and Exploration in Low-rank MDPs , 2021, ArXiv.

[25]  Ofir Nachum,et al.  Representation Matters: Offline Pretraining for Sequential Decision Making , 2021, ICML.

[26]  Zhuoran Yang,et al.  Is Pessimism Provably Efficient for Offline RL? , 2020, ICML.

[27]  Sergey Levine,et al.  Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning , 2020, ICLR.

[28]  Pieter Abbeel,et al.  Decoupling Representation Learning from Reinforcement Learning , 2020, ICML.

[29]  S. Levine,et al.  Learning Invariant Representations for Reinforcement Learning without Reconstruction , 2020, ICLR.

[30]  Pierre H. Richemond,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[31]  R Devon Hjelm,et al.  Deep Reinforcement and InfoMax Learning , 2020, NeurIPS.

[32]  S. Levine,et al.  Conservative Q-Learning for Offline Reinforcement Learning , 2020, NeurIPS.

[33]  Lantao Yu,et al.  MOPO: Model-based Offline Policy Optimization , 2020, NeurIPS.

[34]  Phillip Isola,et al.  Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere , 2020, ICML.

[35]  S. Levine,et al.  Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems , 2020, ArXiv.

[36]  Justin Fu,et al.  D4RL: Datasets for Deep Data-Driven Reinforcement Learning , 2020, ArXiv.

[37]  Pieter Abbeel,et al.  CURL: Contrastive Unsupervised Representations for Reinforcement Learning , 2020, ICML.

[38]  Sanjeev Arora,et al.  Provable Representation Learning for Imitation Learning via Bi-level Optimization , 2020, ICML.

[39]  Kavosh Asadi,et al.  Learning State Abstractions for Transfer in Continuous Control , 2020, ArXiv.

[40]  Akshay Krishnamurthy,et al.  Kinematic State Abstraction and Provably Efficient Rich-Observation Reinforcement Learning , 2019, ICML.

[41]  Vinicius G. Goecks,et al.  Integrating Behavior Cloning and Reinforcement Learning for Improved Performance in Sparse Reward Environments , 2019, AAMAS.

[42]  Joelle Pineau,et al.  Improving Sample Efficiency in Model-Free Reinforcement Learning from Images , 2019, AAAI.

[43]  Yifan Wu,et al.  Behavior Regularized Offline Reinforcement Learning , 2019, ArXiv.

[44]  Romain Laroche,et al.  Safe Policy Improvement with Soft Baseline Bootstrapping , 2019, ECML/PKDD.

[45]  Sergey Levine,et al.  Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model , 2019, NeurIPS.

[46]  Alexander Carballo,et al.  A Survey of Autonomous Driving: Common Practices and Emerging Technologies , 2019, IEEE Access.

[47]  Sergey Levine,et al.  Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction , 2019, NeurIPS.

[48]  Fredrik D. Johansson,et al.  Guidelines for reinforcement learning in healthcare , 2019, Nature Medicine.

[49]  Doina Precup,et al.  Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[50]  Ruben Villegas,et al.  Learning Latent Dynamics for Planning from Pixels , 2018, ICML.

[51]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[52]  Lu Wang,et al.  Supervised Reinforcement Learning with Recurrent Neural Network for Dynamic Treatment Recommendation , 2018, KDD.

[53]  Sergey Levine,et al.  Diversity is All You Need: Learning Skills without a Reward Function , 2018, ICLR.

[54]  Romain Laroche,et al.  Safe Policy Improvement with Baseline Bootstrapping , 2017, ICML.

[55]  Marcin Andrychowicz,et al.  Overcoming Exploration in Reinforcement Learning with Demonstrations , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[56]  Sergey Levine,et al.  Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations , 2017, Robotics: Science and Systems.

[57]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[58]  Trevor Darrell,et al.  Loss is its own Reward: Self-Supervision for Reinforcement Learning , 2016, ICLR.

[59]  Marek Petrik,et al.  Safe Policy Improvement by Minimizing Robust Baseline Regret , 2016, NIPS.

[60]  Balaraman Ravindran,et al.  Hierarchical Reinforcement Learning using Spatio-Temporal Abstractions and Deep Neural Networks , 2016, ArXiv.

[61]  Xin Zhang,et al.  End to End Learning for Self-Driving Cars , 2016, ArXiv.

[62]  Witold Pedrycz,et al.  A Clustering-Based Graph Laplacian Framework for Value Function Approximation in Reinforcement Learning , 2014, IEEE Transactions on Cybernetics.

[63]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[64]  Doina Precup,et al.  On-the-Fly Algorithms for Bisimulation Metrics , 2012, 2012 Ninth International Conference on Quantitative Evaluation of Systems.

[65]  David P. Woodruff,et al.  Fast approximation of matrix coherence and statistical leverage , 2011, ICML.

[66]  Ameet Talwalkar,et al.  Can matrix coherence be efficiently and accurately estimated? , 2011, AISTATS.

[67]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[68]  Emmanuel J. Candès,et al.  Exact Matrix Completion via Convex Optimization , 2008, Found. Comput. Math..

[69]  Doina Precup,et al.  Methods for Computing State Similarity in Markov Decision Processes , 2006, UAI.

[70]  S. Muthukrishnan,et al.  Sampling algorithms for l2 regression and applications , 2006, SODA '06.

[71]  Peter Stone,et al.  State Abstraction Discovery from Irrelevant State Variables , 2005, IJCAI.

[72]  Shie Mannor,et al.  Dynamic abstraction in reinforcement learning via clustering , 2004, ICML.

[73]  David Andre,et al.  State abstraction for programmable reinforcement learning agents , 2002, AAAI/IAAI.

[74]  Dean Pomerleau,et al.  Efficient Training of Artificial Neural Networks for Autonomous Navigation , 1991, Neural Computation.

[75]  Sepp Hochreiter,et al.  Understanding the Effects of Dataset Characteristics on Offline Reinforcement Learning , 2021, ArXiv.

[76]  Thomas J. Walsh,et al.  Towards a Unified Theory of State Abstraction for MDPs , 2006, AI&M.