Accountable Off-Policy Evaluation With Kernel Bellman Statistics

We consider off-policy evaluation (OPE), which evaluates the performance of a new policy from observed data collected from previous experiments, without requiring the execution of the new policy. This finds important applications in areas with high execution cost or safety concerns, such as medical diagnosis, recommendation systems and robotics. In practice, due to the limited information from off-policy data, it is highly desirable to construct rigorous confidence intervals, not just point estimation, for the policy performance. In this work, we propose a new variational framework which reduces the problem of calculating tight confidence bounds in OPE into an optimization problem on a feasible set that catches the true state-action value function with high probability. The feasible set is constructed by leveraging statistical properties of a recently proposed kernel Bellman loss (Feng et al., 2019). We design an efficient computational approach for calculating our bounds, and extend it to perform post-hoc diagnosis and correction for existing estimators. Empirical results show that our method yields tight confidence intervals in different settings.

[1]  Philip S. Thomas,et al.  High Confidence Policy Improvement , 2015, ICML.

[2]  Yifei Ma,et al.  Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling , 2019, NeurIPS.

[3]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[4]  Philip S. Thomas,et al.  High-Confidence Off-Policy Evaluation , 2015, AAAI.

[5]  Qiang Liu,et al.  Doubly Robust Bias Reduction in Infinite Horizon Off-Policy Estimation , 2019, ICLR.

[6]  Mehrdad Farajtabar,et al.  More Robust Doubly Robust Off-policy Evaluation , 2018, ICML.

[7]  Qiang Liu,et al.  Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.

[8]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[9]  Bo Dai,et al.  GenDICE: Generalized Offline Estimation of Stationary Values , 2020, ICLR.

[10]  G. Imbens,et al.  Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score , 2002 .

[11]  Predrag Klasnja,et al.  Off-Policy Estimation of Long-Term Average Outcomes with Applications to Mobile Health , 2019, ArXiv.

[12]  J M Robins,et al.  Marginal Mean Models for Dynamic Regimes , 2001, Journal of the American Statistical Association.

[13]  Shie Mannor,et al.  Regularized Policy Iteration with Nonparametric Function Spaces , 2016, J. Mach. Learn. Res..

[14]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[15]  G. Imbens,et al.  Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score , 2000 .

[16]  Qiang Liu,et al.  Black-box Off-policy Estimation for Infinite-Horizon Reinforcement Learning , 2020, ICLR.

[17]  Philip S. Thomas,et al.  Predictive Off-Policy Policy Evaluation for Nonstationary Decision Problems, with Applications to Digital Marketing , 2017, AAAI.

[18]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[19]  Masatoshi Uehara,et al.  Minimax Weight and Q-Function Learning for Off-Policy Evaluation , 2019, ICML.

[20]  Xin Xu,et al.  Kernel-Based Least Squares Policy Iteration for Reinforcement Learning , 2007, IEEE Transactions on Neural Networks.

[21]  Bo Dai,et al.  DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections , 2019, NeurIPS.

[22]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[23]  Stephen P. Boyd,et al.  CVXPY: A Python-Embedded Modeling Language for Convex Optimization , 2016, J. Mach. Learn. Res..

[24]  Masatoshi Uehara,et al.  Double Reinforcement Learning for Efficient and Robust Off-Policy Evaluation , 2020, ICML.

[25]  Martha White,et al.  Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains , 2010, NIPS.

[26]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[27]  Lihong Li,et al.  Doubly Robust Off-policy Evaluation for Reinforcement Learning , 2015, ArXiv.

[28]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[29]  Xin Xu,et al.  Kernel Least-Squares Temporal Difference Learning , 2006 .

[30]  Michael I. Jordan,et al.  Provably Efficient Reinforcement Learning with Linear Function Approximation , 2019, COLT.

[31]  Wei Chu,et al.  Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[32]  Gavin Taylor,et al.  Kernelized value function approximation for reinforcement learning , 2009, ICML '09.

[33]  Stephen Boyd,et al.  A Rewriting System for Convex Optimization Problems , 2017, ArXiv.

[34]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[35]  Peter Stone,et al.  Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation , 2016, AAAI.

[36]  T. Jung,et al.  Kernelizing LSPE(λ) , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[37]  Michael I. Jordan,et al.  Is Q-learning Provably Efficient? , 2018, NeurIPS.

[38]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[39]  Mengdi Wang,et al.  Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation , 2020, ICML.

[40]  Guy Lever,et al.  Modelling transition dynamics in MDPs with RKHS embeddings , 2012, ICML.

[41]  Qiang Liu,et al.  A Kernel Loss for Solving the Bellman Equation , 2019, NeurIPS.

[42]  Miroslav Dudík,et al.  Optimal and Adaptive Off-policy Evaluation in Contextual Bandits , 2016, ICML.

[43]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[44]  Jiawei Huang,et al.  Minimax Confidence Interval for Off-Policy Evaluation and Policy Optimization , 2020, ArXiv.

[45]  Lihong Li,et al.  Policy Certificates: Towards Accountable Reinforcement Learning , 2018, ICML.

[46]  Fang Han,et al.  An Exponential Inequality for U-Statistics Under Mixing Conditions , 2016, Journal of Theoretical Probability.

[47]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.