Off-policy evaluation for slate recommendation

This paper studies the evaluation of policies that recommend an ordered set of items (e.g., a ranking) based on some context---a common scenario in web search, ads, and recommendation. We build on techniques from combinatorial bandits to introduce a new practical estimator that uses logged data to estimate a policy's performance. A thorough empirical evaluation on real-world data reveals that our estimator is accurate in a variety of settings, including as a subroutine in a learning-to-rank task, where it achieves competitive performance. We derive conditions under which our estimator is unbiased---these conditions are weaker than prior heuristics for slate evaluation---and experimentally demonstrate a smaller bias than parametric approaches, even when these conditions are violated. Finally, our theory and experiments also show exponential savings in the amount of required data compared with general unbiased estimators.

[1]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[2]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[3]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[4]  Bruce Hendrickson,et al.  Support Theory for Preconditioning , 2003, SIAM J. Matrix Anal. Appl..

[5]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[6]  Kaare Brandt Petersen,et al.  The Matrix Cookbook , 2006 .

[7]  John Langford,et al.  The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information , 2007, NIPS.

[8]  Thomas P. Hayes,et al.  The Price of Bandit Information for Online Optimization , 2007, NIPS.

[9]  Benjamin Piwowarski,et al.  A user browsing model to predict search engine click data from past observations. , 2008, SIGIR '08.

[10]  Ron Kohavi,et al.  Responsible editor: R. Bayardo. , 2022 .

[11]  John Langford,et al.  Exploration scavenging , 2008, ICML '08.

[12]  Olivier Chapelle,et al.  A dynamic bayesian network click model for web search ranking , 2009, WWW '09.

[13]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[14]  Nicolò Cesa-Bianchi,et al.  Combinatorial Bandits , 2012, COLT.

[15]  Chao Liu,et al.  Click chain model in web search , 2009, WWW '09.

[16]  John N. Tsitsiklis,et al.  Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[17]  Aurélien Garivier,et al.  Parametric Bandits: The Generalized Linear Case , 2010, NIPS.

[18]  Robert E. Schapire,et al.  Non-Stochastic Bandit Slate Problems , 2010, NIPS.

[19]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[20]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[21]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[22]  Wei Chu,et al.  Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[23]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[24]  Jimmy J. Lin,et al.  Training Efficient Tree-Based Models for Document Ranking , 2013, ECIR.

[25]  Tao Qin,et al.  Introducing LETOR 4.0 Datasets , 2013, ArXiv.

[26]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[27]  Xiaoyan Zhu,et al.  Contextual Combinatorial Bandit and its Application on Diversified Online Recommendation , 2014, SDM.

[28]  John Langford,et al.  Doubly Robust Policy Evaluation and Optimization , 2014, ArXiv.

[29]  Wtt Wtt Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits , 2015 .

[30]  Djoerd Hiemstra,et al.  A cross-benchmark comparison of 87 learning to rank methods , 2015, Inf. Process. Manag..

[31]  Thorsten Joachims,et al.  Counterfactual Risk Minimization: Learning from Logged Bandit Feedback , 2015, ICML.

[32]  Akshay Krishnamurthy,et al.  Efficient Contextual Semi-Bandit Learning , 2015, ArXiv.

[33]  Lihong Li,et al.  Toward Predicting the Outcome of an A/B Experiment for Search Relevance , 2015, WSDM.

[34]  Yue Wang,et al.  Beyond Ranking: Optimizing Whole-Page Presentation , 2016, WSDM.

[35]  Filip Radlinski,et al.  Online Evaluation for Information Retrieval , 2016, Found. Trends Inf. Retr..

[36]  Miroslav Dudík,et al.  Optimal and Adaptive Off-policy Evaluation in Contextual Bandits , 2016, ICML.