Triply Robust Off-Policy Evaluation

We propose a robust regression approach to off-policy evaluation (OPE) for contextual bandits. We frame OPE as a covariate-shift problem and leverage modern robust regression tools. Ours is a general approach that can be used to augment any existing OPE method that utilizes the direct method. When augmenting doubly robust methods, we call the resulting method Triply Robust. We prove upper bounds on the resulting bias and variance, as well as derive novel minimax bounds based on robust minimax analysis for covariate shift. Our robust regression method is compatible with deep learning, and is thus applicable to complex OPE settings that require powerful function approximators. Finally, we demonstrate superior empirical performance across the standard OPE benchmarks, especially in the case where the logging policy is unknown and must be estimated from data.

[1]  John Langford,et al.  Doubly Robust Policy Evaluation and Optimization , 2014, ArXiv.

[2]  Susan Athey,et al.  Machine Learning and Causal Inference for Policy Evaluation , 2015, KDD.

[3]  Brian D. Ziebart,et al.  Robust Covariate Shift Prediction with General Losses and Feature Views , 2017, ArXiv.

[4]  Masatoshi Uehara,et al.  Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning , 2019, NeurIPS.

[5]  Yisong Yue,et al.  Hierarchical Exploration for Accelerating Contextual Bandits , 2012, ICML.

[6]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[7]  Philip S. Thomas,et al.  High-Confidence Off-Policy Evaluation , 2015, AAAI.

[8]  Soon-Jo Chung,et al.  Robust Regression for Safe Exploration in Control , 2019, L4DC.

[9]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[10]  John Langford,et al.  Off-policy evaluation for slate recommendation , 2016, NIPS.

[11]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[12]  J. Robins,et al.  Semiparametric Efficiency in Multivariate Regression Models with Missing Data , 1995 .

[13]  Yisong Yue,et al.  Batch Policy Learning under Constraints , 2019, ICML.

[14]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[15]  M. de Rijke,et al.  Deep Learning with Logged Bandit Feedback , 2018, ICLR.

[16]  Ambuj Tewari,et al.  An Actor-Critic Contextual Bandit Algorithm for Personalized Interventions using Mobile Devices , 2014 .

[17]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[18]  Joseph Kang,et al.  Demystifying Double Robustness: A Comparison of Alternative Strategies for Estimating a Population Mean from Incomplete Data , 2007, 0804.2958.

[19]  Liang Tang,et al.  Automatic ad format selection via contextual bandits , 2013, CIKM.

[20]  Thorsten Joachims,et al.  Effective Evaluation Using Logged Bandit Feedback from Multiple Loggers , 2017, KDD.

[21]  Thorsten Joachims,et al.  Batch learning from logged bandit feedback through counterfactual risk minimization , 2015, J. Mach. Learn. Res..

[22]  Yifei Ma,et al.  Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling , 2019, NeurIPS.

[23]  J. Robins,et al.  Doubly Robust Estimation in Missing Data and Causal Inference Models , 2005, Biometrics.

[24]  Uri Shalit,et al.  Learning Representations for Counterfactual Inference , 2016, ICML.

[25]  Joel Veness,et al.  Variance Reduction in Monte-Carlo Tree Search , 2011, NIPS.

[26]  M. J. D. Powell,et al.  Weighted Uniform Sampling — a Monte Carlo Technique for Reducing Variance , 1966 .

[27]  David Sontag,et al.  Counterfactual Off-Policy Evaluation with Gumbel-Max Structural Causal Models , 2019, ICML.

[28]  Miroslav Dudík,et al.  Optimal and Adaptive Off-policy Evaluation in Contextual Bandits , 2016, ICML.

[29]  Lihong Li,et al.  Toward Minimax Off-policy Value Estimation , 2015, AISTATS.

[30]  Mehrdad Farajtabar,et al.  More Robust Doubly Robust Off-policy Evaluation , 2018, ICML.

[31]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[32]  Thorsten Joachims,et al.  The Self-Normalized Estimator for Counterfactual Learning , 2015, NIPS.

[33]  Li He Off-policy Learning for Multiple Loggers , 2019 .

[34]  Brian D. Ziebart,et al.  Robust Covariate Shift Regression , 2016, AISTATS.

[35]  Yi Su,et al.  Doubly robust off-policy evaluation with shrinkage , 2019, ICML.

[36]  J. Langford,et al.  The Epoch-Greedy algorithm for contextual multi-armed bandits , 2007, NIPS 2007.

[37]  Andreas Krause,et al.  Contextual Gaussian Process Bandit Optimization , 2011, NIPS.