论文信息 - Large-scale Validation of Counterfactual Learning Methods: A Test-Bed

Large-scale Validation of Counterfactual Learning Methods: A Test-Bed

The ability to perform effective off-policy learning would revolutionize the process of building better interactive systems, such as search engines and recommendation systems for e-commerce, computational advertising and news. Recent approaches for off-policy evaluation and learning in these settings appear promising. With this paper, we provide real-world data and a standardized test-bed to systematically investigate these algorithms using data from display advertising. In particular, we consider the problem of filling a banner ad with an aggregate of multiple products the user may want to purchase. This paper presents our test-bed, the sanity checks we ran to ensure its validity, and shows results comparing state-of-the-art off-policy learning methods like doubly robust optimization, POEM, and reductions to supervised learning using regression baselines. Our results show experimental evidence that recent off-policy learning methods can improve upon state-of-the-art supervised learning techniques on a large-scale real-world data set.

[1] M. de Rijke,et al. Online Exploration for Detecting Shifts in Fresh Intent , 2014, CIKM.

[2] Rómer Rosales,et al. Simple and Scalable Response Prediction for Display Advertising , 2014, ACM Trans. Intell. Syst. Technol..

[3] John Langford,et al. Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[4] Joaquin Quiñonero Candela,et al. Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[5] Olivier Chapelle,et al. Cost-sensitive Learning for Utility Optimization in Online Advertising Auctions , 2016, ADKDD@KDD.

[6] Wei Chu,et al. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[7] Lihong Li,et al. Counterfactual Estimation and Optimization of Click Metrics in Search Engines: A Case Study , 2015, WWW.

[8] Thorsten Joachims,et al. Batch learning from logged bandit feedback through counterfactual risk minimization , 2015, J. Mach. Learn. Res..

[9] Thorsten Joachims,et al. The Self-Normalized Estimator for Counterfactual Learning , 2015, NIPS.

[10] Martin Wattenberg,et al. Ad click prediction: a view from the trenches , 2013, KDD.

[11] T. Hesterberg,et al. Weighted Average Importance Sampling and Defensive Mixture Distributions , 1995 .

[12] D. Rubin,et al. The central role of the propensity score in observational studies for causal effects , 1983 .

[13] Gleb Gusev,et al. Gathering Additional Feedback on Search Results by Multi-Armed Bandits with Respect to Production Ranking , 2015, WWW.