论文信息 - Doubly Robust Off-policy Evaluation for Reinforcement Learning - 字舞流文

Doubly Robust Off-policy Evaluation for Reinforcement Learning

We study the problem of evaluating a policy that is different from the one that generates data. Such a problem, known as off-policy evaluation in reinforcement learning (RL), is encountered whenever one wants to estimate the value of a new solution, based on historical data, before actually deploying it in the real system, which is a critical step of applying RL in most real-world applications. Despite the fundamental importance of the problem, existing general methods either have uncontrolled bias or suffer high variance. In this work, we extend the so-called doubly robust estimator for bandits to sequential decision-making problems, which gets the best of both worlds: it is guaranteed to be unbiased and has low variance, and as a point estimator, it outperforms the most popular importance-sampling estimator and its variants in most occasions. We also provide theoretical results on the hardness of the problem, and show that our estimator can match the asymptotic lower bound in certain scenarios.

Nan Jiang | Lihong Li | Lihong Li | Nan Jiang

[1] C. Glymour,et al. STATISTICS AND CAUSAL INFERENCE , 1985 .

[2] J. Robins,et al. Semiparametric regression estimation in the presence of dependent censoring , 1995 .

[3] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[4] Doina Precup,et al. Intra-Option Learning about Temporally Abstract Actions , 1998, ICML.

[5] G. Imbens,et al. Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score , 2000 .

[6] J. Pearl. Causality: Models, Reasoning and Inference , 2000 .

[7] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[8] Doina Precup,et al. Temporal abstraction in reinforcement learning , 2000, ICML 2000.

[9] J M Robins,et al. Marginal Mean Models for Dynamic Regimes , 2001, Journal of the American Statistical Association.

[10] Sanjoy Dasgupta,et al. Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[11] G. Imbens,et al. Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score , 2002 .

[12] John Langford,et al. Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[13] Naoki Abe,et al. Sequential cost-sensitive decision making with reinforcement learning , 2002, KDD.

[14] Richard S. Sutton,et al. Reinforcement learning with replacing eligibility traces , 2004, Machine Learning.

[15] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[16] John N. Tsitsiklis,et al. Bias and Variance Approximation in Value Function Estimates , 2007, Manag. Sci..

[17] Peter Stone,et al. Model-based function approximation in reinforcement learning , 2007, AAMAS '07.

[18] T. Moore. A Theory of Cramer-Rao Bounds for Constrained Parametric Models , 2010 .

[19] John Langford,et al. Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[20] Wei Chu,et al. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[21] Guy Lever,et al. Modelling transition dynamics in MDPs with RKHS embeddings , 2012, ICML.

[22] Louis Wehenkel,et al. Batch mode reinforcement learning based on the synthesis of artificial trajectories , 2013, Ann. Oper. Res..

[23] Joaquin Quiñonero Candela,et al. Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[24] Daniele Calandriello,et al. Safe Policy Iteration , 2013, ICML.

[25] Jan Peters,et al. Policy evaluation with temporal differences: a survey and comparison , 2015, J. Mach. Learn. Res..

[26] Philip S. Thomas,et al. High Confidence Policy Improvement , 2015, ICML.

[27] Philip S. Thomas,et al. High-Confidence Off-Policy Evaluation , 2015, AAAI.

[28] Philip S. Thomas,et al. Safe Reinforcement Learning , 2015 .

[29] Martha White,et al. An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[30] Dirk Ormoneit,et al. Kernel-Based Reinforcement Learning , 2017, Encyclopedia of Machine Learning and Data Mining.