Bridging Offline Reinforcement Learning and Imitation Learning: A Tale of Pessimism

Offline reinforcement learning (RL) algorithms seek to learn an optimal policy from a fixed dataset without active data collection. Based on the composition of the offline dataset, two main methods are used: imitation learning which is suitable for expert datasets, and vanilla offline RL which often requires uniform coverage datasets. From a practical standpoint, datasets often deviate from these two extremes and the exact data composition is usually unknown. To bridge this gap, we present a new offline RL framework, called single-policy concentrability, that smoothly interpolates between the two extremes of data composition, hence unifying imitation learning and vanilla offline RL. Under this new framework, we ask: can one develop an algorithm that achieves a minimax optimal rate adaptive to unknown data composition? To address this question, we consider a lower confidence bound (LCB) algorithm developed based on pessimism in the face of uncertainty in offline RL. We study finite-sample properties of LCB as well as information-theoretic limits in multi-armed bandits, contextual bandits, and Markov decision processes (MDPs). Our analysis reveals surprising facts about optimality rates. In particular, in both contextual bandits and RL, LCB achieves a fast convergence rate for nearly-expert datasets, analogous to the one achieved by imitation learning, contrary to the slow rate achieved in offline RL. In contextual bandits, we prove that LCB is adaptively optimal for the entire data composition range, achieving a smooth transition from imitation learning to offline RL. We further show that LCB is almost adaptively optimal in MDPs.

[1]  Tor Lattimore,et al.  On the Optimality of Batch Policy Optimization Algorithms , 2021, ICML.

[2]  Sergey Levine,et al.  COMBO: Conservative Offline Model-Based Policy Optimization , 2021, NeurIPS.

[3]  Masatoshi Uehara,et al.  Finite Sample Analysis of Minimax Offline Reinforcement Learning: Completeness, Fast Rates and First-Order Efficiency , 2021, ArXiv.

[4]  Yu-Xiang Wang,et al.  Near-Optimal Offline Reinforcement Learning via Double Variance Reduction , 2021, NeurIPS.

[5]  Martin J. Wainwright,et al.  Minimax Off-Policy Evaluation for Multi-Armed Bandits , 2021, IEEE Transactions on Information Theory.

[6]  Zhuoran Yang,et al.  Is Pessimism Provably Efficient for Offline RL? , 2020, ICML.

[7]  Pang Wei Koh,et al.  WILDS: A Benchmark of in-the-Wild Distribution Shifts , 2020, ICML.

[8]  Andrea Zanette,et al.  Exponential Lower Bounds for Batch Reinforcement Learning: Batch RL can be Exponentially Harder than Online RL , 2020, ICML.

[9]  Tor Lattimore,et al.  Sparse Feature Selection Makes Batch Reinforcement Learning More Sample Efficient , 2020, ICML.

[10]  Yang Yu,et al.  Error Bounds of Imitating Policies and Environments for Reinforcement Learning , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Ruosong Wang,et al.  What are the Statistical Limits of Offline RL with Linear Function Approximation? , 2020, ICLR.

[12]  Michal Valko,et al.  Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited , 2020, ALT.

[13]  Marc G. Bellemare,et al.  The Importance of Pessimism in Fixed-Dataset Policy Optimization , 2020, ICLR.

[14]  Lin F. Yang,et al.  Toward the Fundamental Limits of Imitation Learning , 2020, NeurIPS.

[15]  Nan Jiang,et al.  Batch Value-function Approximation with Only Realizability , 2020, ICML.

[16]  S. Murphy,et al.  BATCH POLICY LEARNING IN AVERAGE REWARD MARKOV DECISION PROCESSES. , 2020, Annals of statistics.

[17]  Seyed Kamyar Seyed Ghasemipour,et al.  EMaQ: Expected-Max Q-Learning Operator for Simple Yet Effective Offline and Online RL , 2020, ICML.

[18]  Emma Brunskill,et al.  Provably Good Batch Reinforcement Learning Without Great Exploration , 2020, ArXiv.

[19]  Yu Bai,et al.  Near Optimal Provable Uniform Convergence in Off-Policy Evaluation for Reinforcement Learning , 2020, ArXiv.

[20]  Alec Koppel,et al.  Variational Policy Gradient Method for Reinforcement Learning with General Utilities , 2020, NeurIPS.

[21]  Nando de Freitas,et al.  Critic Regularized Regression , 2020, NeurIPS.

[22]  Sergio Gomez Colmenarejo,et al.  RL Unplugged: Benchmarks for Offline Reinforcement Learning , 2020, ArXiv.

[23]  S. Levine,et al.  Conservative Q-Learning for Offline Reinforcement Learning , 2020, NeurIPS.

[24]  Lantao Yu,et al.  MOPO: Model-based Offline Policy Optimization , 2020, NeurIPS.

[25]  Yuxin Chen,et al.  Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model , 2020, NeurIPS.

[26]  T. Joachims,et al.  MOReL : Model-Based Offline Reinforcement Learning , 2020, NeurIPS.

[27]  S. Levine,et al.  Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems , 2020, ArXiv.

[28]  Justin Fu,et al.  D4RL: Datasets for Deep Data-Driven Reinforcement Learning , 2020, ArXiv.

[29]  Mengdi Wang,et al.  Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation , 2020, ICML.

[30]  Bo Dai,et al.  GenDICE: Generalized Offline Estimation of Stationary Values , 2020, ICLR.

[31]  Martin A. Riedmiller,et al.  Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning , 2020, ICLR.

[32]  S. Whiteson,et al.  GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values , 2020, ICML.

[33]  Bo Dai,et al.  Reinforcement Learning via Fenchel-Rockafellar Duality , 2020, ArXiv.

[34]  Ilya Kostrikov,et al.  AlgaeDICE: Policy Gradient from Arbitrary Experience , 2019, ArXiv.

[35]  Masatoshi Uehara,et al.  Minimax Weight and Q-Function Learning for Off-Policy Evaluation , 2019, ICML.

[36]  Lin F. Yang,et al.  Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning? , 2019, ICLR.

[37]  Joelle Pineau,et al.  Benchmarking Batch Deep Reinforcement Learning Algorithms , 2019, ArXiv.

[38]  Sergey Levine,et al.  Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning , 2019, ArXiv.

[39]  Yifan Wu,et al.  Behavior Regularized Offline Reinforcement Learning , 2019, ArXiv.

[40]  Zhaoran Wang,et al.  Neural Policy Gradient Methods: Global Optimality and Rates of Convergence , 2019, ICLR.

[41]  S. Kakade,et al.  Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes , 2019, COLT.

[42]  Romain Laroche,et al.  Safe Policy Improvement with Soft Baseline Bootstrapping , 2019, ECML/PKDD.

[43]  Rishabh Agarwal,et al.  An Optimistic Perspective on Offline Reinforcement Learning , 2019, ICML.

[44]  Natasha Jaques,et al.  Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog , 2019, ArXiv.

[45]  Alexander Carballo,et al.  A Survey of Autonomous Driving: Common Practices and Emerging Technologies , 2019, IEEE Access.

[46]  Lin F. Yang,et al.  Model-Based Reinforcement Learning with a Generative Model is Minimax Optimal , 2019, COLT.

[47]  Sergey Levine,et al.  Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction , 2019, NeurIPS.

[48]  Bo Dai,et al.  DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections , 2019, NeurIPS.

[49]  Nan Jiang,et al.  On Value Functions and the Agent-Environment Boundary , 2019, ArXiv.

[50]  Qiang Liu,et al.  A Kernel Loss for Solving the Bellman Equation , 2019, NeurIPS.

[51]  Xinkun Nie,et al.  Learning When-to-Treat Policies , 2019, Journal of the American Statistical Association.

[52]  Nan Jiang,et al.  Information-Theoretic Considerations in Batch Reinforcement Learning , 2019, ICML.

[53]  Fredrik D. Johansson,et al.  Guidelines for reinforcement learning in healthcare , 2019, Nature Medicine.

[54]  Tim Salimans,et al.  Learning Montezuma's Revenge from a Single Demonstration , 2018, ArXiv.

[55]  Doina Precup,et al.  Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[56]  Michael I. Jordan,et al.  Is Q-learning Provably Efficient? , 2018, NeurIPS.

[57]  Lu Wang,et al.  Supervised Reinforcement Learning with Recurrent Neural Network for Dynamic Treatment Recommendation , 2018, KDD.

[58]  Lin F. Yang,et al.  Near-Optimal Time and Sample Complexities for Solving Discounted Markov Decision Process with a Generative Model , 2018, 1806.01492.

[59]  Romain Laroche,et al.  Safe Policy Improvement with Baseline Bootstrapping , 2017, ICML.

[60]  Xian Wu,et al.  Variance reduced value iteration and faster algorithms for solving Markov decision processes , 2017, SODA.

[61]  Byron Boots,et al.  Agile Autonomous Driving using End-to-End Deep Imitation Learning , 2017, Robotics: Science and Systems.

[62]  Philip S. Thomas,et al.  Predictive Off-Policy Policy Evaluation for Nonstationary Decision Problems, with Applications to Digital Marketing , 2017, AAAI.

[63]  Nan Jiang,et al.  Contextual Decision Processes with low Bellman rank are PAC-Learnable , 2016, ICML.

[64]  John C. Duchi,et al.  Variance-based Regularization with Convex Objectives , 2016, NIPS.

[65]  Yanjun Han,et al.  Minimax estimation of the L1 distance , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[66]  Matthieu Geist,et al.  Is the Bellman residual a bad proxy? , 2016, NIPS.

[67]  Xin Zhang,et al.  End to End Learning for Self-Driving Cars , 2016, ArXiv.

[68]  John Langford,et al.  PAC Reinforcement Learning with Rich Observations , 2016, NIPS.

[69]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[70]  Christoph Dann,et al.  Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning , 2015, NIPS.

[71]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[72]  Lihong Li,et al.  Toward Minimax Off-policy Value Estimation , 2015, AISTATS.

[73]  Thorsten Joachims,et al.  Counterfactual Risk Minimization , 2015, ICML.

[74]  Boi Faltings,et al.  Offline and online evaluation of news recommender systems at swissinfo.ch , 2014, RecSys '14.

[75]  Bruno Scherrer,et al.  Approximate Policy Iteration Schemes: A Comparison , 2014, ICML.

[76]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[77]  Hilbert J. Kappen,et al.  On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.

[78]  Tor Lattimore,et al.  PAC Bounds for Discounted MDPs , 2012, ALT.

[79]  Rémi Munos,et al.  Pure exploration in finitely-armed and continuous-armed bandits , 2011, Theor. Comput. Sci..

[80]  Csaba Szepesvári,et al.  Error Propagation for Approximate Policy and Value Iteration , 2010, NIPS.

[81]  U. Rieder,et al.  Markov Decision Processes , 2010 .

[82]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[83]  J. Andrew Bagnell,et al.  Efficient Reductions for Imitation Learning , 2010, AISTATS.

[84]  Lihong Li,et al.  Learning from Logged Implicit Exploration Data , 2010, NIPS.

[85]  Csaba Szepesvári,et al.  Fitted Q-iteration in continuous action-space MDPs , 2007, NIPS.

[86]  Rémi Munos,et al.  Performance Bounds in Lp-norm for Approximate Value Iteration , 2007, SIAM J. Control. Optim..

[87]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[88]  Csaba Szepesvári,et al.  Finite time bounds for sampling based fitted value iteration , 2005, ICML.

[89]  Rémi Munos,et al.  Error Bounds for Approximate Policy Iteration , 2003, ICML.

[90]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[91]  E. Gilbert A comparison of signalling alphabets , 1952 .

[92]  Sergio Gomez Colmenarejo,et al.  RL Unplugged: A Suite of Benchmarks for Offline Reinforcement Learning , 2020 .

[93]  Emma Brunskill,et al.  Provably Good Batch Off-Policy Reinforcement Learning Without Great Exploration , 2020, NeurIPS.

[94]  Qi Cai,et al.  Neural Trust Region/Proximal Policy Optimization Attains Globally Optimal Policy , 2019, NeurIPS.

[95]  Martin A. Riedmiller,et al.  Batch Reinforcement Learning , 2012, Reinforcement Learning.

[96]  Bin Yu Assouad, Fano, and Le Cam , 1997 .

[97]  L. L. Cam,et al.  Asymptotic Methods In Statistical Decision Theory , 1986 .

[98]  L. Goddard Information Theory , 1962, Nature.