CAB: Continuous Adaptive Blending for Policy Evaluation and Learning

The ability to perform offline A/B-testing and off-policy learning using logged contextual bandit feedback is highly desirable in a broad range of applications, including recommender systems, search engines, ad placement, and personalized health care. Both offline A/B-testing and offpolicy learning require a counterfactual estimator that evaluates how some new policy would have performed, if it had been used instead of the logging policy. In this paper, we present and analyze a family of counterfactual estimators which subsumes most estimators proposed to date. Most importantly, this analysis identifies a new estimator – called Continuous Adaptive Blending (CAB) – which enjoys many advantageous theoretical and practical properties. In particular, it can be substantially less biased than clipped Inverse Propensity Score (IPS) weighting and the Direct Method, and it can have less variance than Doubly Robust and IPS estimators. In addition, it is subdifferentiable such that it can be used for learning, unlike the SWITCH estimator. Experimental results show that CAB provides excellent evaluation accuracy and outperforms other counterfactual estimators in terms of learning performance.

[1]  Lihong Li,et al.  Learning from Logged Implicit Exploration Data , 2010, NIPS.

[2]  John Langford,et al.  The offset tree for learning with partial labels , 2008, KDD.

[3]  Mehrdad Farajtabar,et al.  More Robust Doubly Robust Off-policy Evaluation , 2018, ICML.

[4]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[5]  Thorsten Joachims,et al.  Batch learning from logged bandit feedback through counterfactual risk minimization , 2015, J. Mach. Learn. Res..

[6]  J. Robins,et al.  Doubly Robust Estimation in Missing Data and Causal Inference Models , 2005, Biometrics.

[7]  David E. Booth,et al.  Analysis of Incomplete Multivariate Data , 2000, Technometrics.

[8]  Miroslav Dudík,et al.  Optimal and Adaptive Off-policy Evaluation in Contextual Bandits , 2016, ICML.

[9]  Thorsten Joachims,et al.  Intervention Harvesting for Context-Dependent Examination-Bias Estimation , 2018, SIGIR.

[10]  J. Robins,et al.  Semiparametric Efficiency in Multivariate Regression Models with Missing Data , 1995 .

[11]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[12]  Thorsten Joachims,et al.  Estimating Position Bias without Intrusive Interventions , 2018, WSDM.

[13]  Marc Najork,et al.  Position Bias Estimation for Unbiased Learning to Rank in Personal Search , 2018, WSDM.

[14]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[15]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[16]  D. Rubin,et al.  Multiple Imputation for Nonresponse in Surveys , 1989 .

[17]  Joseph Kang,et al.  Demystifying Double Robustness: A Comparison of Alternative Strategies for Estimating a Population Mean from Incomplete Data , 2007, 0804.2958.

[18]  John Langford,et al.  Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits , 2014, ICML.

[19]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[20]  M. de Rijke,et al.  Deep Learning with Logged Bandit Feedback , 2018, ICLR.

[21]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[22]  Thorsten Joachims,et al.  Unbiased Learning-to-Rank with Biased Feedback , 2016, WSDM.

[23]  John Langford,et al.  Cost-sensitive learning by cost-proportionate example weighting , 2003, Third IEEE International Conference on Data Mining.

[24]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[25]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[26]  Thorsten Joachims,et al.  The Self-Normalized Estimator for Counterfactual Learning , 2015, NIPS.