Doubly Robust Policy Evaluation and Optimization

We study sequential decision making in environments where rewards are only partially observed, but can be modeled as a function of observed contexts and the chosen action by the decision maker. This setting, known as contextual bandits, encompasses a wide variety of applications such as health care, content recommendation and Internet advertising. A central task is evaluation of a new policy given historic data consisting of contexts, actions and received rewards. The key challenge is that the past data typically does not faithfully represent proportions of actions taken by a new policy. Previous approaches rely either on models of rewards or models of the past policy. The former are plagued by a large bias whereas the latter have a large variance. In this work, we leverage the strengths and overcome the weaknesses of the two approaches by applying the doubly robust estimation technique to the problems of policy evaluation and optimization. We prove that this approach yields accurate value estimates when we have either a good (but not necessarily consistent) model of rewards or a good (but not necessarily consistent) model of past policy. Extensive empirical comparison demonstrates that the doubly robust estimation uniformly improves over existing techniques, achieving both lower variance in value estimation and better policies. As such, we expect the doubly robust approach to become common practice in policy evaluation and optimization.

[1]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[2]  D. Freedman On Tail Probabilities for Martingales , 1975 .

[3]  C. Cassel,et al.  Some results on generalized difference estimation and generalized regression estimation for finite populations , 1976 .

[4]  P. Anandan,et al.  Pattern-recognizing stochastic learning automata , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[5]  J. Robins A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect , 1986 .

[6]  J. Robins,et al.  Estimation of Regression Coefficients When Some Regressors are not Always Observed , 1994 .

[7]  J. Robins,et al.  Semiparametric Efficiency in Multivariate Regression Models with Missing Data , 1995 .

[8]  J. Robins,et al.  Semiparametric regression estimation in the presence of dependent censoring , 1995 .

[9]  John Langford,et al.  Beating the hold-out: bounds for K-fold and progressive cross-validation , 1999, COLT '99.

[10]  H. Shimodaira,et al.  Improving predictive inference under covariate shift by weighting the log-likelihood function , 2000 .

[11]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[12]  J M Robins,et al.  Marginal Mean Models for Dynamic Regimes , 2001, Journal of the American Statistical Association.

[13]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[14]  S. Murphy,et al.  Optimal dynamic treatment regimes , 2003 .

[15]  John Langford,et al.  Exploration in Metric State Spaces , 2003, ICML.

[16]  J. Lunceford,et al.  Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study , 2004, Statistics in medicine.

[17]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[18]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[19]  Joseph Kang,et al.  Demystifying Double Robustness: A Comparison of Alternative Strategies for Estimating a Population Mean from Incomplete Data , 2007, 0804.2958.

[20]  Marie Davidian,et al.  Comment: Demystifying Double Robustness: A Comparison of Alternative Strategies for Estimating a Population Mean from Incomplete Data. , 2008, Statistical science : a review journal of the Institute of Mathematical Statistics.

[21]  J. Langford,et al.  The Epoch-Greedy algorithm for contextual multi-armed bandits , 2007, NIPS 2007.

[22]  A. Beygelzimer Multiclass Classification with Filter Trees , 2007 .

[23]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[24]  Diane Lambert,et al.  More bang for their bucks: assessing new features for online advertisers , 2007, ADKDD '07.

[25]  John Langford,et al.  Exploration scavenging , 2008, ICML '08.

[26]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[27]  Neil D. Lawrence,et al.  Dataset Shift in Machine Learning , 2009 .

[28]  John Langford,et al.  The offset tree for learning with partial labels , 2008, KDD.

[29]  S. Vansteelandt,et al.  Marginal structural models for partial exposure regimes. , 2008, Biostatistics.

[30]  Tamir Hazan,et al.  Direct Loss Minimization for Structured Prediction , 2010, NIPS.

[31]  Lihong Li,et al.  Learning from Logged Implicit Exploration Data , 2010, NIPS.

[32]  J. Robins,et al.  The International Journal of Biostatistics CAUSAL INFERENCE Dynamic Regime Marginal Structural Mean Models for Estimation of Optimal Dynamic Treatment Regimes , Part I : Main Content , 2011 .

[33]  Rong Ge,et al.  Evaluating online ad campaigns in a pipeline: causal models at scale , 2010, KDD.

[34]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[35]  John Langford,et al.  Contextual Bandit Algorithms with Supervised Learning Guarantees , 2010, AISTATS.

[36]  Elad Hazan,et al.  Better Algorithms for Benign Bandits , 2009, J. Mach. Learn. Res..

[37]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[38]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[39]  Wei Chu,et al.  Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[40]  John Langford,et al.  Sample-efficient Nonstationary Policy Evaluation for Contextual Bandits , 2012, UAI.

[41]  S. Vansteelandt,et al.  On model selection and model misspecification in causal inference , 2012, Statistical methods in medical research.

[42]  Eric B. Laber,et al.  A Robust Method for Estimating Optimal Treatment Regimes , 2012, Biometrics.

[43]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[44]  Deepak Agarwal,et al.  Content recommendation on web portals , 2013, CACM.