On Minimax Optimal Offline Policy Evaluation

This paper studies the off-policy evaluation problem, where one aims to estimate the value of a target policy based on a sample of observations collected by another policy. We first consider the multi-armed bandit case, establish a minimax risk lower bound, and analyze the risk of two standard estimators. It is shown, and verified in simulation, that one is minimax optimal up to a constant, while another can be arbitrarily worse, despite its empirical success and popularity. The results are applied to related problems in contextual bandits and fixed-horizon Markov decision processes, and are also related to semi-supervised learning.

[1]  J. Langford,et al.  The Epoch-Greedy algorithm for contextual multi-armed bandits , 2007, NIPS 2007.

[2]  Xiaojin Zhu,et al.  Introduction to Semi-Supervised Learning , 2009, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[3]  Wei Chu,et al.  Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[4]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[5]  Yaoliang Yu,et al.  Analysis of Kernel Mean Matching under Covariate Shift , 2012, ICML.

[6]  D. Rubin,et al.  Reducing Bias in Observational Studies Using Subclassification on the Propensity Score , 1984 .

[7]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[8]  Maxim Raginsky,et al.  Information-Based Complexity, Feedback and Dynamics in Convex Programming , 2010, IEEE Transactions on Information Theory.

[9]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[10]  John Langford,et al.  Exploration scavenging , 2008, ICML '08.

[11]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[12]  Sanjoy Dasgupta,et al.  Two faces of active learning , 2009, Theor. Comput. Sci..

[13]  Lihong Li,et al.  Learning from Logged Implicit Exploration Data , 2010, NIPS.

[14]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[15]  Masashi Sugiyama,et al.  Input-dependent estimation of generalization error under covariate shift , 2005 .

[16]  Christian Igel,et al.  Hoeffding and Bernstein races for selecting policies in evolutionary direct policy search , 2009, ICML '09.

[17]  Eli Upfal,et al.  Probability and Computing: Randomized Algorithms and Probabilistic Analysis , 2005 .

[18]  G. Imbens,et al.  Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score , 2000 .

[19]  Richard S. Sutton,et al.  A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.