Optimal Off-Policy Evaluation from Multiple Logging Policies

We study off-policy evaluation (OPE) from multiple logging policies, each generating a dataset of fixed size, i.e., stratified sampling. Previous work noted that in this setting the ordering of the variances of different importance sampling estimators is instance-dependent, which brings up a dilemma as to which importance sampling weights to use. In this paper, we resolve this dilemma by finding the OPE estimator for multiple loggers with minimum variance for any instance, i.e., the efficient one. In particular, we establish the efficiency bound under stratified sampling and propose an estimator achieving this bound when given consistent $q$-estimates. To guard against misspecification of $q$-functions, we also provide a way to choose the control variate in a hypothesis class to minimize variance. Extensive experiments demonstrate the benefits of our methods' efficiently leveraging of the stratified sampling of off-policy data from multiple loggers.

[1]  P. Bickel Efficient and Adaptive Estimation for Semiparametric Models , 1993 .

[2]  C. Geyer Estimating Normalizing Constants and Reweighting Mixtures , 1994 .

[3]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[4]  Jeffrey M. Wooldridge,et al.  ASYMPTOTIC PROPERTIES OF WEIGHTED M-ESTIMATORS FOR STANDARD STRATIFIED SAMPLES , 2001, Econometric Theory.

[5]  S. Murphy,et al.  Optimal dynamic treatment regimes , 2003 .

[6]  P. McCullagh,et al.  A theory of statistical models for Monte Carlo integration , 2003 .

[7]  Zhiqiang Tan,et al.  On a Likelihood Approach for Monte Carlo Integration , 2004 .

[8]  A. Tsiatis Semiparametric Theory and Missing Data , 2006 .

[9]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[10]  Mark J van der Laan,et al.  Empirical Efficiency Maximization: Improved Locally Efficient Covariate Adjustment in Randomized Experiments and Survival Analysis , 2008, The international journal of biostatistics.

[11]  M. Davidian,et al.  Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data , 2009, Biometrika.

[12]  Lihong Li,et al.  Learning from Logged Implicit Exploration Data , 2010, NIPS.

[13]  Aapo Hyvärinen,et al.  Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[14]  Mark J. van der Laan,et al.  Cross-Validated Targeted Minimum-Loss-Based Estimation , 2011 .

[15]  Marie Davidian,et al.  Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions. , 2013, Biometrika.

[16]  Sergey Levine,et al.  Offline policy evaluation across representations with applications to educational games , 2014, AAMAS.

[17]  John Langford,et al.  Doubly Robust Policy Evaluation and Optimization , 2014, ArXiv.

[18]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[19]  Susan Athey,et al.  The Econometrics of Randomized Experiments , 2016, 1607.00698.

[20]  M. J. van der Laan,et al.  STATISTICAL INFERENCE FOR THE MEAN OUTCOME UNDER A POSSIBLY NON-UNIQUE OPTIMAL TREATMENT STRATEGY. , 2016, Annals of statistics.

[21]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[22]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[23]  Elias Bareinboim,et al.  Causal inference and the data-fusion problem , 2016, Proceedings of the National Academy of Sciences.

[24]  Miroslav Dudík,et al.  Optimal and Adaptive Off-policy Evaluation in Contextual Bandits , 2016, ICML.

[25]  Thorsten Joachims,et al.  Effective Evaluation Using Logged Bandit Feedback from Multiple Loggers , 2017, KDD.

[26]  John Langford,et al.  Off-policy evaluation for slate recommendation , 2016, NIPS.

[27]  Nathan Kallus,et al.  Recursive Partitioning for Personalization using Observational Data , 2016, ICML.

[28]  J. Robins,et al.  Double/Debiased Machine Learning for Treatment and Structural Parameters , 2017 .

[29]  Shota Yasui,et al.  Efficient Counterfactual Learning from Bandit Feedback , 2018, AAAI.

[30]  Masatoshi Uehara,et al.  Analysis of Noise Contrastive Estimation from the Perspective of Asymptotic Variance , 2018, ArXiv.

[31]  Mehrdad Farajtabar,et al.  More Robust Doubly Robust Off-policy Evaluation , 2018, ICML.

[32]  Nathan Kallus,et al.  Balanced Policy Evaluation and Learning , 2017, NeurIPS.

[33]  Qiang Liu,et al.  Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.

[34]  Li He,et al.  Off-policy Learning for Multiple Loggers , 2019, KDD.

[35]  Masatoshi Uehara,et al.  Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning , 2019, NeurIPS.

[36]  Masatoshi Uehara,et al.  Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning , 2019 .

[37]  Yi Su,et al.  Adaptive Estimator Selection for Off-Policy Evaluation , 2020, ICML.

[38]  Masatoshi Uehara,et al.  Double Reinforcement Learning for Efficient Off-Policy Evaluation in Markov Decision Processes , 2019, J. Mach. Learn. Res..

[39]  Kelly W. Zhang,et al.  Inference for Batched Bandits , 2020, NeurIPS.

[40]  Bo Dai,et al.  GenDICE: Generalized Offline Estimation of Stationary Values , 2020, ICLR.

[41]  Masatoshi Uehara,et al.  Statistically Efficient Off-Policy Policy Gradients , 2020, ICML.

[42]  Masatoshi Uehara,et al.  Efficient Evaluation of Natural Stochastic Policies in Offline Reinforcement Learning , 2020, Biometrika.

[43]  Masatoshi Uehara,et al.  Minimax Weight and Q-Function Learning for Off-Policy Evaluation , 2019, ICML.

[44]  J. Honda,et al.  Adaptive Experimental Design for Efficient Treatment Effect Estimation: Randomized Allocation via Contextual Bandit Algorithm , 2020, ArXiv.

[45]  Shu Yang,et al.  Combining Multiple Observational Data Sources to Estimate Causal Effects , 2018, Journal of the American Statistical Association.

[46]  Hongyuan Zha,et al.  Infinite-horizon Off-Policy Policy Evaluation with Multiple Behavior Policies , 2020, ICLR.

[47]  Krikamol Muandet,et al.  Counterfactual Mean Embeddings , 2018, J. Mach. Learn. Res..

[48]  Yu Bai,et al.  Near Optimal Provable Uniform Convergence in Off-Policy Evaluation for Reinforcement Learning , 2020, ArXiv.

[49]  Stefan Wager,et al.  Confidence intervals for policy evaluation in adaptive experiments , 2019, Proceedings of the National Academy of Sciences.

[50]  G. Imbens,et al.  Efficient estimation and stratified sampling , 1996 .