论文信息 - Off-Policy Interval Estimation with Lipschitz Value Iteration

Off-Policy Interval Estimation with Lipschitz Value Iteration

Off-policy evaluation provides an essential tool for evaluating the effects of different policies or treatments using only observed data. When applied to high-stakes scenarios such as medical diagnosis or financial decision-making, it is crucial to provide provably correct upper and lower bounds of the expected reward, not just a classical single point estimate, to the end-users, as executing a poor policy can be very costly. In this work, we propose a provably correct method for obtaining interval bounds for off-policy evaluation in a general continuous setting. The idea is to search for the maximum and minimum values of the expected reward among all the Lipschitz Q-functions that are consistent with the observations, which amounts to solving a constrained optimization problem on a Lipschitz function space. We go on to introduce a Lipschitz value iteration method to monotonically tighten the interval, which is simple yet efficient and provably convergent. We demonstrate the practical efficiency of our method on a range of benchmarks.

[1] Qiang Liu,et al. Black-box Off-policy Estimation for Infinite-Horizon Reinforcement Learning , 2020, ICLR.

[2] Yifei Ma,et al. Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling , 2019, NeurIPS.

[3] Qiang Liu,et al. A Kernel Loss for Solving the Bellman Equation , 2019, NeurIPS.

[4] Yao Liu,et al. Representation Balancing MDPs for Off-Policy Policy Evaluation , 2018, NeurIPS.

[5] Philip S. Thomas,et al. Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[6] Louis Wehenkel,et al. Clinical data based optimal STI strategies for HIV: a reinforcement learning approach , 2006, Proceedings of the 45th IEEE Conference on Decision and Control.

[7] Qiang Liu,et al. Doubly Robust Bias Reduction in Infinite Horizon Off-Policy Estimation , 2019, ICLR.

[8] Philip S. Thomas,et al. High-Confidence Off-Policy Evaluation , 2015, AAAI.

[9] Philip S. Thomas,et al. High Confidence Policy Improvement , 2015, ICML.

[10] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[11] Tim Hesterberg,et al. Monte Carlo Strategies in Scientific Computing , 2002, Technometrics.

[12] Nan Jiang,et al. Doubly Robust Off-policy Evaluation for Reinforcement Learning , 2015, ArXiv.

[13] Gérard D. Cohen,et al. Covering radius - Survey and recent results , 1985, IEEE Trans. Inf. Theory.

[14] Qiang Liu,et al. Accountable Off-Policy Evaluation With Kernel Bellman Statistics , 2020, ICML.

[15] Louis Wehenkel,et al. Batch mode reinforcement learning based on the synthesis of artificial trajectories , 2013, Ann. Oper. Res..

[16] Lihong Li,et al. Policy Certificates: Towards Accountable Reinforcement Learning , 2018, ICML.

[17] Yisong Yue,et al. Batch Policy Learning under Constraints , 2019, ICML.

[18] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[19] Martha White,et al. Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains , 2010, NIPS.

[20] Qiang Liu,et al. Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.

[21] Bo Dai,et al. DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections , 2019, NeurIPS.

[22] Peter Stone,et al. Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation , 2016, AAAI.

[23] Wei Chu,et al. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[24] Joaquin Quiñonero Candela,et al. Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[25] Nan Jiang,et al. Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[26] Masatoshi Uehara,et al. Efficiently Breaking the Curse of Horizon: Double Reinforcement Learning in Infinite-Horizon Processes , 2019, ArXiv.

[27] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[28] Mengdi Wang,et al. Learning to Control in Metric Space with Optimal Regret , 2019, 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[29] Masatoshi Uehara,et al. Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning , 2019 .

[30] Zhao Song,et al. Efficient Model-free Reinforcement Learning in Metric Spaces , 2019, ArXiv.

[31] Michael I. Jordan,et al. Is Q-learning Provably Efficient? , 2018, NeurIPS.

[32] Philip S. Thomas,et al. Predictive Off-Policy Policy Evaluation for Nonstationary Decision Problems, with Applications to Digital Marketing , 2017, AAAI.

[33] Bo Dai,et al. GenDICE: Generalized Offline Estimation of Stationary Values , 2020, ICLR.

[34] J M Robins,et al. Marginal Mean Models for Dynamic Regimes , 2001, Journal of the American Statistical Association.

[35] Csaba Szepesvári,et al. Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..