Near Optimal Provable Uniform Convergence in Off-Policy Evaluation for Reinforcement Learning

The Off-Policy Evaluation aims at estimating the performance of target policy $\pi$ using offline data rolled in by a logging policy $\mu$. Intensive studies have been conducted and the recent marginalized importance sampling (MIS) achieves the sample efficiency for OPE. However, it is rarely known if uniform convergence guarantees in OPE can be obtained efficiently. In this paper, we consider this new question and reveal the comprehensive relationship between OPE and offline learning for the first time. For the global policy class, by using the fully model-based OPE estimator, our best result is able to achieve $\epsilon$-uniform convergence with complexity $\widetilde{O}(H^3\cdot\min(S,H)/d_m\epsilon^2)$, where $d_m$ is an instance-dependent quantity decided by $\mu$. This result is only one factor away from our uniform convergence lower bound up to a logarithmic factor. For the local policy class, $\epsilon$-uniform convergence is achieved with the optimal complexity $\widetilde{O}(H^3/d_m\epsilon^2)$ in the off-policy setting. This result complements the work of sparse model-based planning (Agarwal et al. 2019) with generative model. Lastly, one interesting corollary of our intermediate result implies a refined analysis over simulation lemma.

[1]  Yu-Xiang Wang,et al.  Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning , 2020, AISTATS.

[2]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[3]  Nan Jiang,et al.  Contextual Decision Processes with low Bellman rank are PAC-Learnable , 2016, ICML.

[4]  Masatoshi Uehara,et al.  Double Reinforcement Learning for Efficient Off-Policy Evaluation in Markov Decision Processes , 2019, J. Mach. Learn. Res..

[5]  Philip S. Thomas,et al.  Predictive Off-Policy Policy Evaluation for Nonstationary Decision Problems, with Applications to Digital Marketing , 2017, AAAI.

[6]  Natasha Jaques,et al.  Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog , 2019, ArXiv.

[7]  Nan Jiang,et al.  Information-Theoretic Considerations in Batch Reinforcement Learning , 2019, ICML.

[8]  Fredrik D. Johansson,et al.  Guidelines for reinforcement learning in healthcare , 2019, Nature Medicine.

[9]  Miroslav Dudík,et al.  Optimal and Adaptive Off-policy Evaluation in Contextual Bandits , 2016, ICML.

[10]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[11]  Masatoshi Uehara,et al.  Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning , 2019 .

[12]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[13]  Wei Chu,et al.  Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[14]  M. J. D. Powell,et al.  Weighted Uniform Sampling — a Monte Carlo Technique for Reducing Variance , 1966 .

[15]  Philip S. Thomas,et al.  Safe Reinforcement Learning , 2015 .

[16]  Sergey Levine,et al.  RoboNet: Large-Scale Multi-Robot Learning , 2019, CoRL.

[17]  Peter Szolovits,et al.  Continuous State-Space Models for Optimal Sepsis Treatment: a Deep Reinforcement Learning Approach , 2017, MLHC.

[18]  Martin A. Riedmiller,et al.  Batch Reinforcement Learning , 2012, Reinforcement Learning.

[19]  Liang Tang,et al.  Automatic ad format selection via contextual bandits , 2013, CIKM.

[20]  Masatoshi Uehara,et al.  Efficiently Breaking the Curse of Horizon: Double Reinforcement Learning in Infinite-Horizon Processes , 2019, ArXiv.

[21]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[22]  H. Chernoff A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations , 1952 .

[23]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[24]  J. Tropp FREEDMAN'S INEQUALITY FOR MATRIX MARTINGALES , 2011, 1101.3039.

[25]  John Langford,et al.  PAC Reinforcement Learning with Rich Observations , 2016, NIPS.

[26]  Ohad Shamir,et al.  Learnability, Stability and Uniform Convergence , 2010, J. Mach. Learn. Res..

[27]  Yinyu Ye,et al.  The Simplex and Policy-Iteration Methods Are Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate , 2011, Math. Oper. Res..

[28]  Yifei Ma,et al.  Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling , 2019, NeurIPS.

[29]  Emma Brunskill,et al.  Off-Policy Policy Gradient with State Distribution Correction , 2019, UAI 2019.

[30]  Lihong Li,et al.  Toward Minimax Off-policy Value Estimation , 2015, AISTATS.

[31]  Marco Corazza,et al.  Testing different Reinforcement Learning con?gurations for ?nancial trading: Introduction and applications , 2018 .

[32]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[33]  Masatoshi Uehara,et al.  Minimax Weight and Q-Function Learning for Off-Policy Evaluation , 2019, ICML.

[34]  Michael I. Jordan,et al.  Provably Efficient Reinforcement Learning with Linear Function Approximation , 2019, COLT.

[35]  Sergey Levine,et al.  Deep Reinforcement Learning for Vision-Based Robotic Grasping: A Simulated Comparative Evaluation of Off-Policy Methods , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[36]  Qiang Liu,et al.  Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.

[37]  Masatoshi Uehara,et al.  Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning , 2019, NeurIPS.

[38]  Fan Chung Graham,et al.  Concentration Inequalities and Martingale Inequalities: A Survey , 2006, Internet Math..

[39]  Michael I. Jordan,et al.  Is Q-learning Provably Efficient? , 2018, NeurIPS.

[40]  Mehrdad Farajtabar,et al.  More Robust Doubly Robust Off-policy Evaluation , 2018, ICML.

[41]  Yisong Yue,et al.  Batch Policy Learning under Constraints , 2019, ICML.

[42]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[43]  Christoph Dann,et al.  Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning , 2015, NIPS.

[44]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[45]  Mengdi Wang,et al.  Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation , 2020, ICML.

[46]  Nan Jiang,et al.  $Q^\star$ Approximation Schemes for Batch Reinforcement Learning: A Theoretical Comparison , 2020, 2003.03924.

[47]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[48]  Nan Jiang,et al.  Batch Value-function Approximation with Only Realizability , 2020, ICML.

[49]  Louis Wehenkel,et al.  Clinical data based optimal STI strategies for HIV: a reinforcement learning approach , 2006, Proceedings of the 45th IEEE Conference on Decision and Control.

[50]  Xian Wu,et al.  Near-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model , 2018, NeurIPS.

[51]  Yao Liu,et al.  Behaviour Policy Estimation in Off-Policy Policy Evaluation: Calibration Matters , 2018, ArXiv.

[52]  Lin F. Yang,et al.  On the Optimality of Sparse Model-Based Planning for Markov Decision Processes , 2019, ArXiv.

[53]  Lin F. Yang,et al.  Model-Based Reinforcement Learning with a Generative Model is Minimax Optimal , 2019, COLT 2020.

[54]  Sergey Levine,et al.  Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems , 2020, ArXiv.

[55]  Emma Brunskill,et al.  Provably Good Batch Reinforcement Learning Without Great Exploration , 2020, ArXiv.

[56]  Ambuj Tewari,et al.  Reinforcement learning in large or unknown mdps , 2007 .

[57]  Nan Jiang,et al.  Open Problem: The Dependence of Sample Complexity Lower Bounds on Planning Horizon , 2018, COLT.

[58]  Rémi Munos,et al.  Error Bounds for Approximate Policy Iteration , 2003, ICML.

[59]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..