Offline Evaluation and Optimization for Interactive Systems

Evaluating and optimizing an interactive system (like search engines, recommender and advertising systems) from historical data against a predefined online metric is challenging, especially when that metric is computed from user feedback such as clicks and payments. The key challenge is counterfactual in nature: we only observe a user's feedback for actions taken by the system, but we do not know what that user would have reacted to a different action. The golden standard to evaluate such metrics of a user-interacting system is online A/B experiments (a.k.a. randomized controlled experiments), which can be expensive in terms of both time and engineering resources. Offline evaluation/optimization (sometimes referred to as off-policy learning in the literature) thus becomes critical, aiming to evaluate the same metrics without running (many) expensive A/B experiments on live users. One approach to offline evaluation is to build a user model that simulates user behavior (clicks, purchases, etc.) under various contexts, and then evaluate metrics of a system with this simulator. While being straightforward and common in practice, the reliability of such model-based approaches relies heavily on how well the user model is built. Furthermore, it is often difficult to know a priori whether a user model is good enough to be trustable. Recent years have seen a growing interest in another solution to the offline evaluation problem. Using statistical techniques like importance sampling and doubly robust estimation, the approach can give unbiased estimates of metrics for a wide range of problems. It enjoys other benefits as well. For example, it often allows data scientists to obtain a confidence interval for the estimate to quantify the amount of uncertainty; it does not require building user models, so is more robust and easier to apply. All these benefits make the approach particularly attractive to a wide range of problems. Successful applications have been reported in the last few years by some of the industrial leaders. This tutorial gives a review of the basic theory and representative techniques. Applications of these techniques are illustrated through several case studies done at Microsoft and Yahoo!.

[1]  Tapas Kanungo,et al.  Model characterization curves for federated search using click-logs: predicting user engagement metrics for the span of feasible operating points , 2011, WWW.

[2]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[3]  D. Rubin Estimating causal effects of treatments in randomized and nonrandomized studies. , 1974 .

[4]  J. Heckman Sample selection bias as a specification error , 1979 .

[5]  Deepak Agarwal,et al.  Personalized click shaping through lagrangian duality for online recommendation , 2012, SIGIR '12.

[6]  Liang Zhang,et al.  Activity ranking in LinkedIn feed , 2014, KDD.

[7]  John Langford,et al.  Sample-efficient Nonstationary Policy Evaluation for Contextual Bandits , 2012, UAI.

[8]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[9]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[10]  John Langford,et al.  Exploration scavenging , 2008, ICML '08.

[11]  Diane Lambert,et al.  More bang for their bucks: assessing new features for online advertisers , 2007, SKDD.

[12]  John Langford,et al.  The offset tree for learning with partial labels , 2008, KDD.

[13]  Katja Hofmann,et al.  Fidelity, Soundness, and Efficiency of Interleaved Comparison Methods , 2013, TOIS.

[14]  Wei Chu,et al.  Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[15]  Lihong Li,et al.  Counterfactual Estimation and Optimization of Click Metrics for Search Engines , 2014, ArXiv.

[16]  Deepak Agarwal,et al.  Click shaping to optimize multiple objectives , 2011, KDD.

[17]  Neil D. Lawrence,et al.  Dataset Shift in Machine Learning , 2009 .

[18]  Olivier Nicol,et al.  Improving offline evaluation of contextual bandit algorithms via bootstrapping techniques , 2014, ICML.

[19]  John Langford,et al.  Efficient Online Bootstrapping for Large Scale Learning , 2013, ArXiv.

[20]  Lihong Li,et al.  Evaluation of Explore-Exploit Policies in Multi-result Ranking Systems , 2015, ArXiv.

[21]  Liang Tang,et al.  Automatic ad format selection via contextual bandits , 2013, CIKM.

[22]  J. Robins,et al.  Estimation of Regression Coefficients When Some Regressors are not Always Observed , 1994 .

[23]  Lihong Li,et al.  Toward Predicting the Outcome of an A/B Experiment for Search Relevance , 2015, WSDM.

[24]  Rong Ge,et al.  Evaluating online ad campaigns in a pipeline: causal models at scale , 2010, KDD.

[25]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[26]  John Langford,et al.  The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information , 2007, NIPS.

[27]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[28]  Katja Hofmann,et al.  Reusing historical interaction data for faster online learning to rank for IR , 2013, DIR.

[29]  Lihong Li,et al.  Learning from Logged Implicit Exploration Data , 2010, NIPS.

[30]  Wei Chu,et al.  An Online Learning Framework for Refining Recency Search Results with User Click Feedback , 2012, TOIS.

[31]  Stuart J. Russell,et al.  Online bagging and boosting , 2005, 2005 IEEE International Conference on Systems, Man and Cybernetics.