Minimax Off-Policy Evaluation for Multi-Armed Bandits

We study the problem of off-policy evaluation in the multi-armed bandit model with bounded rewards, and develop minimax rate-optimal procedures under three settings. First, when the behavior policy is known, we show that the Switch estimator, a method that alternates between the plug-in and importance sampling estimators, is minimax rate-optimal for all sample sizes. Second, when the behavior policy is unknown, we analyze performance in terms of the competitive ratio, thereby revealing a fundamental gap between the settings of known and unknown behavior policies. When the behavior policy is unknown, any estimator must have mean-squared error larger—relative to the oracle estimator equipped with the knowledge of the behavior policy— by a multiplicative factor proportional to the support size of the target policy. Moreover, we demonstrate that the plug-in approach achieves this worst-case competitive ratio up to a logarithmic factor. Third, we initiate the study of the partial knowledge setting in which it is assumed that the minimum probability taken by the behavior policy is known. We show that the plug-in estimator is optimal for relatively large values of the minimum probability, but is sub-optimal when the minimum probability is low. In order to remedy this gap, we propose a new estimator based on approximation by Chebyshev polynomials that provably achieves the optimal estimation error. Numerical experiments on both simulated and real data corroborate our theoretical findings.

[1]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[2]  H. Strasser Mathematical Theory of Statistics: Statistical Experiments and Asymptotic Decision Theory , 1986 .

[3]  A. Nemirovskii,et al.  Some Problems on Nonparametric Estimation in Gaussian White Noise , 1987 .

[4]  A. Timan Theory of Approximation of Functions of a Real Variable , 1994 .

[5]  Gerhard J. Woeginger,et al.  Developments from a June 1996 seminar on Online algorithms: the state of the art , 1998 .

[6]  A. Nemirovski,et al.  On estimation of the Lr norm of a regression function , 1999 .

[7]  G. Imbens,et al.  Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score , 2000 .

[8]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[9]  G. Imbens,et al.  Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score , 2002 .

[10]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[11]  P. Wawrzynski,et al.  Truncated Importance Sampling for Reinforcement Learning with Experience Replay , 2007 .

[12]  E. Ionides Truncated Importance Sampling , 2008 .

[13]  Gregory Valiant,et al.  A CLT and tight lower bounds for estimating entropy , 2010, Electron. Colloquium Comput. Complex..

[14]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[15]  T. Cai,et al.  Testing composite hypotheses, Hermite polynomials and optimal estimation of a nonsmooth functional , 2011, 1105.3039.

[16]  Gregory Valiant,et al.  Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs , 2011, STOC '11.

[17]  Wei Chu,et al.  Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[18]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[19]  Paul Valiant,et al.  Estimating the Unseen , 2013, NIPS.

[20]  Yanjun Han,et al.  Minimax Estimation of Discrete Distributions under ℓ1 Loss , 2014, ArXiv.

[21]  Philip S. Thomas,et al.  Personalized Ad Recommendation Systems for Life-Time Value Optimization with Guarantees , 2015, IJCAI.

[22]  Lihong Li,et al.  Toward Minimax Off-policy Value Estimation , 2015, AISTATS.

[23]  F. Maxwell Harper,et al.  The MovieLens Datasets: History and Context , 2016, TIIS.

[24]  Yanjun Han,et al.  Minimax Estimation of Functionals of Discrete Distributions , 2014, IEEE Transactions on Information Theory.

[25]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[26]  Yanjun Han,et al.  Minimax rate-optimal estimation of KL divergence between discrete distributions , 2016, 2016 International Symposium on Information Theory and Its Applications (ISITA).

[27]  Yihong Wu,et al.  Minimax Rates of Entropy Estimation on Large Alphabets via Best Polynomial Approximation , 2014, IEEE Transactions on Information Theory.

[28]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[29]  Miroslav Dudík,et al.  Optimal and Adaptive Off-policy Evaluation in Contextual Bandits , 2016, ICML.

[30]  Ambuj Tewari,et al.  From Ads to Interventions: Contextual Bandits in Mobile Health , 2017, Mobile Health - Sensors, Analytic Methods, and Applications.

[31]  Sergey Levine,et al.  Off-Policy Evaluation via Off-Policy Classification , 2019, NeurIPS.

[32]  Martin J. Wainwright,et al.  High-Dimensional Statistics , 2019 .

[33]  Yihong Wu,et al.  Chebyshev polynomials, moment matching, and optimal estimation of the unseen , 2015, The Annals of Statistics.

[34]  Yu-Xiang Wang,et al.  Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning , 2020, AISTATS.

[35]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[36]  Mengdi Wang,et al.  Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation , 2020, ICML.

[37]  Nando de Freitas,et al.  Hyperparameter Selection for Offline Reinforcement Learning , 2020, ArXiv.