论文信息 - Doubly robust off-policy evaluation with shrinkage - 字舞流文

Doubly robust off-policy evaluation with shrinkage

We propose a new framework for designing estimators for off-policy evaluation in contextual bandits. Our approach is based on the asymptotically optimal doubly robust estimator, but we shrink the importance weights to minimize a bound on the mean squared error, which results in a better bias-variance tradeoff in finite samples. We use this optimization-based framework to obtain three estimators: (a) a weight-clipping estimator, (b) a new weight-shrinkage estimator, and (c) the first shrinkage-based estimator for combinatorial action sets. Extensive experiments in both standard and combinatorial bandit benchmark problems show that our estimators are highly adaptive and typically outperform state-of-the-art methods.

Yi Su | Akshay Krishnamurthy | Miroslav Dudík | Maria Dimakopoulou | A. Krishnamurthy | Miroslav Dudík | Yi-Hsun Su | Maria Dimakopoulou

[1] Lihong Li,et al. Learning from Logged Implicit Exploration Data , 2010, NIPS.

[2] Nicolò Cesa-Bianchi,et al. Combinatorial Bandits , 2012, COLT.

[3] Nathan Kallus,et al. Balanced Policy Evaluation and Learning , 2017, NeurIPS.

[4] Tao Qin,et al. Introducing LETOR 4.0 Datasets , 2013, ArXiv.

[5] Marie Frei,et al. Decoupling From Dependence To Independence , 2016 .

[6] C. Rothe. The Value of Knowing the Propensity Score for Estimating Average Treatment Effects , 2016, SSRN Electronic Journal.

[7] J. Robins,et al. Doubly Robust Estimation in Missing Data and Causal Inference Models , 2005, Biometrics.

[8] Wei Chu,et al. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[9] J. Robins,et al. Semiparametric Efficiency in Multivariate Regression Models with Missing Data , 1995 .

[10] Baruch Awerbuch,et al. Online linear optimization and adaptive routing , 2008, J. Comput. Syst. Sci..

[11] John Langford,et al. Doubly Robust Policy Evaluation and Optimization , 2014, ArXiv.

[12] Nathan Kallus,et al. A Framework for Optimal Matching for Causal Inference , 2016, AISTATS.

[13] G. Imbens,et al. Mean-Squared-Error Calculations for Average Treatment Effects , 2005 .

[14] Mehrdad Farajtabar,et al. More Robust Doubly Robust Off-policy Evaluation , 2018, ICML.

[15] G. Imbens,et al. Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score , 2002 .

[16] John Langford,et al. Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[17] Thorsten Joachims,et al. The Self-Normalized Estimator for Counterfactual Learning , 2015, NIPS.

[18] Stefan Wager,et al. Efficient Policy Learning , 2017, ArXiv.

[19] Thomas P. Hayes,et al. The Price of Bandit Information for Online Optimization , 2007, NIPS.

[20] Mark J. van der Laan,et al. Data-adaptive selection of the truncation level for Inverse-Probability-of-Treatment-Weighted estimators , 2008 .

[21] Lihong Li,et al. Counterfactual Estimation and Optimization of Click Metrics in Search Engines: A Case Study , 2015, WWW.

[22] John Langford,et al. The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information , 2007, NIPS.

[23] J. Hahn. On the Role of the Propensity Score in Efficient Semiparametric Estimation of Average Treatment Effects , 1998 .

[24] Joaquin Quiñonero Candela,et al. Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[25] Yi Su. Doubly robust off-policy evaluation with shrinkage , 2019 .

[26] G. Imbens,et al. Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score , 2000 .

[27] Philip S. Thomas,et al. Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[28] Yi Su,et al. CAB: Continuous Adaptive Blending Estimator for Policy Evaluation and Learning , 2018, ArXiv.

[29] Michael R Kosorok,et al. Residual Weighted Learning for Estimating Individualized Treatment Rules , 2015, Journal of the American Statistical Association.

[30] Joseph Kang,et al. Demystifying Double Robustness: A Comparison of Alternative Strategies for Estimating a Population Mean from Incomplete Data , 2007, 0804.2958.

[31] Thorsten Joachims,et al. Counterfactual Risk Minimization: Learning from Logged Bandit Feedback , 2015, ICML.

[32] D. Horvitz,et al. A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[33] John Langford,et al. Off-policy evaluation for slate recommendation , 2016, NIPS.

[34] Miroslav Dudík,et al. Optimal and Adaptive Off-policy Evaluation in Contextual Bandits , 2016, ICML.

[35] Marie Davidian,et al. Comment: Demystifying Double Robustness: A Comparison of Alternative Strategies for Estimating a Population Mean from Incomplete Data. , 2008, Statistical science : a review journal of the Institute of Mathematical Statistics.