Learning to Rank in the Position Based Model with Bandit Feedback

Personalization is a crucial aspect of many online experiences. In particular, content ranking is often a key component in delivering sophisticated personalization results. Commonly, supervised learning-to-rank methods are applied, which suffer from bias introduced during data collection by production systems in charge of producing the ranking. To compensate for this problem, we leverage contextual multi-armed bandits. We propose novel extensions of two well-known algorithms viz. LinUCB and Linear Thompson Sampling to the ranking use-case. To account for the biases in a production environment, we employ the position-based click model. Finally, we show the validity of the proposed algorithms by conducting extensive offline experiments on synthetic datasets as well as customer facing online A/B experiments.

[1]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[2]  Zheng Wen,et al.  Cascading Bandits for Large-Scale Recommendation Problems , 2016, UAI.

[3]  Alexandre Proutière,et al.  Combinatorial Bandits Revisited , 2015, NIPS.

[4]  Joaquin Quiñonero Candela,et al.  Web-Scale Bayesian Click-Through rate Prediction for Sponsored Search Advertising in Microsoft's Bing Search Engine , 2010, ICML.

[5]  M. de Rijke,et al.  Click Models for Web Search , 2015, Click Models for Web Search.

[6]  Shuai Li,et al.  On Context-Dependent Clustering of Bandits , 2016, ICML.

[7]  Shuai Li,et al.  Online Clustering of Bandits , 2014, ICML.

[8]  Branislav Kveton,et al.  Efficient Learning in Large-Scale Combinatorial Semi-Bandits , 2014, ICML.

[9]  Zheng Wen,et al.  Cascading Bandits: Learning to Rank in the Cascade Model , 2015, ICML.

[10]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[11]  Olivier Cappé,et al.  Multiple-Play Bandits in the Position-Based Model , 2016, NIPS.

[12]  Shuai Li,et al.  TopRank: A practical algorithm for online stochastic ranking , 2018, NeurIPS.

[13]  S. Muthukrishnan,et al.  Offline Evaluation of Ranking Policies with Click Models , 2018, KDD.

[14]  Ben Carterette,et al.  Offline Evaluation to Make Decisions About PlaylistRecommendation Algorithms , 2019, WSDM.

[15]  Fernando Diaz,et al.  Towards a Fair Marketplace: Counterfactual Evaluation of the trade-off between Relevance, Fairness & Satisfaction in Recommendation Systems , 2018, CIKM.

[16]  J. Sherman,et al.  Adjustment of an Inverse Matrix Corresponding to a Change in One Element of a Given Matrix , 1950 .

[17]  John Langford,et al.  Efficient Optimal Learning for Contextual Bandits , 2011, UAI.

[18]  Matthew Richardson,et al.  Predicting clicks: estimating the click-through rate for new ads , 2007, WWW '07.

[19]  Mounia Lalmas,et al.  Deriving User- and Content-specific Rewards for Contextual Bandits , 2019, WWW.

[20]  M. de Rijke,et al.  An Introduction to Click Models for Web Search: SIGIR 2015 Tutorial , 2015, SIGIR.

[21]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[22]  Antoine Chambaz,et al.  Asymptotically optimal algorithms for budgeted multiple play bandits , 2016, Machine Learning.

[23]  Thorsten Joachims,et al.  Accurately interpreting clickthrough data as implicit feedback , 2005, SIGIR '05.

[24]  Antonino Freno,et al.  Practical Lessons from Developing a Large-Scale Recommender System at Zalando , 2017, RecSys.

[25]  Nick Craswell,et al.  An experimental comparison of click position-bias models , 2008, WSDM '08.

[26]  Akiko Takeda,et al.  Position-based Multiple-play Bandit Problem with Unknown Position Bias , 2017, NIPS.

[27]  Nicolò Cesa-Bianchi,et al.  Combinatorial Bandits , 2012, COLT.

[28]  Marc Najork,et al.  Position Bias Estimation for Unbiased Learning to Rank in Personal Search , 2018, WSDM.