What are the Statistical Limits of Offline RL with Linear Function Approximation?

Offline reinforcement learning seeks to utilize offline (observational) data to guide the learning of (causal) sequential decision making strategies. The hope is that offline reinforcement learning coupled with function approximation methods (to deal with the curse of dimensionality) can provide a means to help alleviate the excessive sample complexity burden in modern sequential decision making problems. However, the extent to which this broader approach can be effective is not well understood, where the literature largely consists of sufficient conditions. This work focuses on the basic question of what are necessary representational and distributional conditions that permit provable sample-efficient offline reinforcement learning. Perhaps surprisingly, our main result shows that even if: i) we have realizability in that the true value function of \emph{every} policy is linear in a given set of features and 2) our off-policy data has good coverage over all features (under a strong spectral condition), then any algorithm still (information-theoretically) requires a number of offline samples that is exponential in the problem horizon in order to non-trivially estimate the value of \emph{any} given policy. Our results highlight that sample-efficient offline policy evaluation is simply not possible unless significantly stronger conditions hold; such conditions include either having low distribution shift (where the offline data distribution is close to the distribution of the policy to be evaluated) or significantly stronger representational conditions (beyond realizability).

[1]  H. Chernoff Sequential Analysis and Optimal Design , 1987 .

[2]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[3]  Benjamin Van Roy,et al.  Feature-based methods for large scale dynamic programming , 1995 .

[4]  John N. Tsitsiklis,et al.  Neuro-dynamic programming: an overview , 1995, Proceedings of 1995 34th IEEE Conference on Decision and Control.

[5]  Geoffrey J. Gordon Stable Fitted Reinforcement Learning , 1995, NIPS.

[6]  K. Ball An elementary introduction to modern convex geometry, in flavors of geometry , 1997 .

[7]  K. Ball An Elementary Introduction to Modern Convex Geometry , 1997 .

[8]  Andrew W. Moore,et al.  Barycentric Interpolators for Continuous Space and Time Reinforcement Learning , 1998, NIPS.

[9]  Geoffrey J. Gordon,et al.  Approximate solutions to markov decision processes , 1999 .

[10]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[11]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[12]  Rémi Munos,et al.  Error Bounds for Approximate Policy Iteration , 2003, ICML.

[13]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[14]  John N. Tsitsiklis,et al.  The Sample Complexity of Exploration in the Multi-Armed Bandit Problem , 2004, J. Mach. Learn. Res..

[15]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[16]  John N. Tsitsiklis,et al.  Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[17]  Csaba Szepesvári,et al.  Finite time bounds for sampling based fitted value iteration , 2005, ICML.

[18]  M. Dahleh Laboratory for Information and Decision Systems , 2005 .

[19]  Liming Xiang,et al.  Kernel-Based Reinforcement Learning , 2006, ICIC.

[20]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[21]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[22]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[23]  Sham M. Kakade,et al.  A tail inequality for quadratic forms of subgaussian random vectors , 2011, ArXiv.

[24]  Sham M. Kakade,et al.  Towards Minimax Policies for Online Linear Optimization with Bandit Feedback , 2012, COLT.

[25]  J. Andrew Bagnell,et al.  Agnostic System Identification for Model-Based Reinforcement Learning , 2012, ICML.

[26]  Sham M. Kakade,et al.  Random Design Analysis of Ridge Regression , 2012, COLT.

[27]  Sergey Levine,et al.  Offline policy evaluation across representations with applications to educational games , 2014, AAMAS.

[28]  Lihong Li,et al.  Toward Minimax Off-policy Value Estimation , 2015, AISTATS.

[29]  Joel A. Tropp,et al.  An Introduction to Matrix Concentration Inequalities , 2015, Found. Trends Mach. Learn..

[30]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[31]  Philip S. Thomas,et al.  High-Confidence Off-Policy Evaluation , 2015, AAAI.

[32]  Philip S. Thomas,et al.  Safe Reinforcement Learning , 2015 .

[33]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[34]  John Langford,et al.  PAC Reinforcement Learning with Rich Observations , 2016, NIPS.

[35]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[36]  Philip S. Thomas,et al.  Using Options and Covariance Testing for Long Horizon Off-Policy Policy Evaluation , 2017, NIPS.

[37]  Miroslav Dudík,et al.  Optimal and Adaptive Off-policy Evaluation in Contextual Bandits , 2016, ICML.

[38]  Byron Boots,et al.  Deeply AggreVaTeD: Differentiable Imitation Learning for Sequential Prediction , 2017, ICML.

[39]  Marcello Restelli,et al.  Boosted Fitted Q-Iteration , 2017, ICML.

[40]  Mehrdad Farajtabar,et al.  More Robust Doubly Robust Off-policy Evaluation , 2018, ICML.

[41]  Qiang Liu,et al.  Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.

[42]  Srivatsan Srinivasan,et al.  Evaluating Reinforcement Learning Algorithms in Observational Health Settings , 2018, ArXiv.

[43]  Lu Wang,et al.  Supervised Reinforcement Learning with Recurrent Neural Network for Dynamic Treatment Recommendation , 2018, KDD.

[44]  Masatoshi Uehara,et al.  Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning , 2019, Oper. Res..

[45]  Natasha Jaques,et al.  Way Off-Policy Batch Deep Reinforcement Learning of Human Preferences in Dialog , 2019 .

[46]  Yuriy Brun,et al.  Preventing undesirable behavior of intelligent machines , 2019, Science.

[47]  Mykel J. Kochenderfer,et al.  Limiting Extrapolation in Linear Approximate Value Iteration , 2019, NeurIPS.

[48]  Yifei Ma,et al.  Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling , 2019, NeurIPS.

[49]  Doina Precup,et al.  Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[50]  Romain Laroche,et al.  Safe Policy Improvement with Baseline Bootstrapping , 2017, ICML.

[51]  Sergey Levine,et al.  Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction , 2019, NeurIPS.

[52]  Yifan Wu,et al.  Behavior Regularized Offline Reinforcement Learning , 2019, ArXiv.

[53]  Nan Jiang,et al.  Information-Theoretic Considerations in Batch Reinforcement Learning , 2019, ICML.

[54]  Chao Yu,et al.  Deep Inverse Reinforcement Learning for Sepsis Treatment , 2019, 2019 IEEE International Conference on Healthcare Informatics (ICHI).

[55]  Mohammad Norouzi,et al.  An Optimistic Perspective on Offline Deep Reinforcement Learning , 2020, International Conference on Machine Learning.

[56]  Ruosong Wang,et al.  Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning? , 2020, ICLR.

[57]  Nan Jiang,et al.  $Q^\star$ Approximation Schemes for Batch Reinforcement Learning: A Theoretical Comparison , 2020, 2003.03924.

[58]  Masatoshi Uehara,et al.  Double Reinforcement Learning for Efficient Off-Policy Evaluation in Markov Decision Processes , 2019, J. Mach. Learn. Res..

[59]  T. Joachims,et al.  MOReL : Model-Based Offline Reinforcement Learning , 2020, NeurIPS.

[60]  Ruosong Wang,et al.  On Reward-Free Reinforcement Learning with Linear Function Approximation , 2020, NeurIPS.

[61]  Rishabh Agarwal,et al.  An Optimistic Perspective on Offline Reinforcement Learning , 2019, ICML.

[62]  Qiang Liu,et al.  Accountable Off-Policy Evaluation With Kernel Bellman Statistics , 2020, ICML.

[63]  Jiawei Huang,et al.  Minimax Confidence Interval for Off-Policy Evaluation and Policy Optimization , 2020, ArXiv.

[64]  Masatoshi Uehara,et al.  Minimax Weight and Q-Function Learning for Off-Policy Evaluation , 2019, ICML.

[65]  Yao Liu,et al.  Understanding the Curse of Horizon in Off-Policy Evaluation via Conditional Importance Sampling , 2019, ICML.

[66]  Sergey Levine,et al.  DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction , 2020, NeurIPS.

[67]  Xi Chen,et al.  On the Sample Complexity of Reinforcement Learning with Policy Space Generalization , 2020, ArXiv.

[68]  Nan Jiang,et al.  Batch Value-function Approximation with Only Realizability , 2020, ICML.