Doubly robust off-policy evaluation with shrinkage

We propose a new framework for designing estimators for off-policy evaluation in contextual bandits. Our approach is based on the asymptotically optimal doubly robust estimator, but we shrink the importance weights to minimize a bound on the mean squared error, which results in a better bias-variance tradeoff in finite samples. We use this optimization-based framework to obtain three estimators: (a) a weight-clipping estimator, (b) a new weight-shrinkage estimator, and (c) the first shrinkage-based estimator for combinatorial action sets. Extensive experiments in both standard and combinatorial bandit benchmark problems show that our estimators are highly adaptive and typically outperform state-of-the-art methods.

[1]  Lihong Li,et al.  Learning from Logged Implicit Exploration Data , 2010, NIPS.

[2]  Nicolò Cesa-Bianchi,et al.  Combinatorial Bandits , 2012, COLT.

[3]  Nathan Kallus,et al.  Balanced Policy Evaluation and Learning , 2017, NeurIPS.

[4]  Tao Qin,et al.  Introducing LETOR 4.0 Datasets , 2013, ArXiv.

[5]  Marie Frei,et al.  Decoupling From Dependence To Independence , 2016 .

[6]  C. Rothe The Value of Knowing the Propensity Score for Estimating Average Treatment Effects , 2016, SSRN Electronic Journal.

[7]  J. Robins,et al.  Doubly Robust Estimation in Missing Data and Causal Inference Models , 2005, Biometrics.

[8]  Wei Chu,et al.  Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[9]  J. Robins,et al.  Semiparametric Efficiency in Multivariate Regression Models with Missing Data , 1995 .

[10]  Baruch Awerbuch,et al.  Online linear optimization and adaptive routing , 2008, J. Comput. Syst. Sci..

[11]  John Langford,et al.  Doubly Robust Policy Evaluation and Optimization , 2014, ArXiv.

[12]  Nathan Kallus,et al.  A Framework for Optimal Matching for Causal Inference , 2016, AISTATS.

[13]  G. Imbens,et al.  Mean-Squared-Error Calculations for Average Treatment Effects , 2005 .

[14]  Mehrdad Farajtabar,et al.  More Robust Doubly Robust Off-policy Evaluation , 2018, ICML.

[15]  G. Imbens,et al.  Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score , 2002 .

[16]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[17]  Thorsten Joachims,et al.  The Self-Normalized Estimator for Counterfactual Learning , 2015, NIPS.

[18]  Stefan Wager,et al.  Efficient Policy Learning , 2017, ArXiv.

[19]  Thomas P. Hayes,et al.  The Price of Bandit Information for Online Optimization , 2007, NIPS.

[20]  Mark J. van der Laan,et al.  Data-adaptive selection of the truncation level for Inverse-Probability-of-Treatment-Weighted estimators , 2008 .

[21]  Lihong Li,et al.  Counterfactual Estimation and Optimization of Click Metrics in Search Engines: A Case Study , 2015, WWW.

[22]  John Langford,et al.  The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information , 2007, NIPS.

[23]  J. Hahn On the Role of the Propensity Score in Efficient Semiparametric Estimation of Average Treatment Effects , 1998 .

[24]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[25]  Yi Su Doubly robust off-policy evaluation with shrinkage , 2019 .

[26]  G. Imbens,et al.  Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score , 2000 .

[27]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[28]  Yi Su,et al.  CAB: Continuous Adaptive Blending Estimator for Policy Evaluation and Learning , 2018, ArXiv.

[29]  Michael R Kosorok,et al.  Residual Weighted Learning for Estimating Individualized Treatment Rules , 2015, Journal of the American Statistical Association.

[30]  Joseph Kang,et al.  Demystifying Double Robustness: A Comparison of Alternative Strategies for Estimating a Population Mean from Incomplete Data , 2007, 0804.2958.

[31]  Thorsten Joachims,et al.  Counterfactual Risk Minimization: Learning from Logged Bandit Feedback , 2015, ICML.

[32]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[33]  John Langford,et al.  Off-policy evaluation for slate recommendation , 2016, NIPS.

[34]  Miroslav Dudík,et al.  Optimal and Adaptive Off-policy Evaluation in Contextual Bandits , 2016, ICML.

[35]  Marie Davidian,et al.  Comment: Demystifying Double Robustness: A Comparison of Alternative Strategies for Estimating a Population Mean from Incomplete Data. , 2008, Statistical science : a review journal of the Institute of Mathematical Statistics.