BubbleRank: Safe Online Learning to Rerank

We study the problem of online learning to re-rank, where users provide feedback to improve the quality of displayed lists. Learning to rank has been traditionally studied in two settings. In the offline setting, rankers are typically learned from relevance labels of judges. These approaches have become the industry standard. However, they lack exploration, and thus are limited by the information content of offline data. In the online setting, an algorithm can propose a list and learn from the feedback on it in a sequential fashion. Bandit algorithms developed for this setting actively experiment, and in this way overcome the biases of offline data. But they also tend to ignore offline data, which results in a high initial cost of exploration. We propose BubbleRank, a bandit algorithm for re-ranking that combines the strengths of both settings. The algorithm starts with an initial base list and improves it gradually by swapping higher-ranked less attractive items for lower-ranked more attractive items. We prove an upper bound on the n-step regret of BubbleRank that degrades gracefully with the quality of the initial base list. Our theoretical findings are supported by extensive numerical experiments on a large real-world click dataset.

[1]  Shubhra Kanti Karmaker Santu,et al.  On Application of Learning to Rank for E-Commerce Search , 2017, SIGIR.

[2]  Filip Radlinski,et al.  Ranked bandits in metric spaces: learning diverse rankings over large document collections , 2013, J. Mach. Learn. Res..

[3]  Alexandre Proutière,et al.  Learning to Rank , 2015, SIGMETRICS.

[4]  Zheng Wen,et al.  Cascading Bandits for Large-Scale Recommendation Problems , 2016, UAI.

[5]  Marc Najork,et al.  Learning to Rank with Selection Bias in Personal Search , 2016, SIGIR.

[6]  Tie-Yan Liu,et al.  Learning to Rank for Information Retrieval , 2011 .

[7]  Andrew Trotman,et al.  The Architecture of eBay Search , 2017, eCOM@SIGIR.

[8]  M. de Rijke,et al.  Click Models for Web Search , 2015, Click Models for Web Search.

[9]  Thorsten Joachims,et al.  Interactively optimizing information retrieval systems as a dueling bandits problem , 2009, ICML '09.

[10]  Olivier Cappé,et al.  Multiple-Play Bandits in the Position-Based Model , 2016, NIPS.

[11]  Filip Radlinski,et al.  Minimally Invasive Randomization for Collecting Unbiased Preferences from Clickthrough Logs , 2006, AAAI 2006.

[12]  Filip Radlinski,et al.  Learning diverse rankings with multi-armed bandits , 2008, ICML '08.

[13]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[14]  Luo Si,et al.  Cascade Ranking for Operational E-commerce Search , 2017, KDD.

[15]  Benjamin Van Roy,et al.  Conservative Contextual Linear Bandits , 2016, NIPS.

[16]  Martin Wattenberg,et al.  Ad click prediction: a view from the trenches , 2013, KDD.

[17]  Ronald L. Rivest,et al.  Introduction to Algorithms, 3rd Edition , 2009 .

[18]  Matthew Richardson,et al.  Predicting clicks: estimating the click-through rate for new ads , 2007, WWW '07.

[19]  Marc Najork,et al.  Position Bias Estimation for Unbiased Learning to Rank in Personal Search , 2018, WSDM.

[20]  M. de Rijke,et al.  Click-based Hot Fixes for Underperforming Torso Queries , 2016, SIGIR.

[21]  Csaba Szepesvári,et al.  Online Learning to Rank in Stochastic Click Models , 2017, ICML.

[22]  Nicolò Cesa-Bianchi,et al.  Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[23]  Katja Hofmann,et al.  Reusing historical interaction data for faster online learning to rank for IR , 2013, DIR.

[24]  Lihong Li,et al.  Learning from Logged Implicit Exploration Data , 2010, NIPS.

[25]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[26]  John Langford,et al.  Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits , 2014, ICML.

[27]  Nick Craswell,et al.  An experimental comparison of click position-bias models , 2008, WSDM '08.

[28]  Tao Qin,et al.  LETOR: A benchmark collection for research on learning to rank for information retrieval , 2010, Information Retrieval.

[29]  Yifan Wu,et al.  Conservative Bandits , 2016, ICML.

[30]  M. de Rijke,et al.  Balancing Speed and Quality in Online Learning to Rank for Information Retrieval , 2017, CIKM.

[31]  Zheng Wen,et al.  Cascading Bandits: Learning to Rank in the Cascade Model , 2015, ICML.

[32]  J. Shane Culpepper,et al.  Efficient Cost-Aware Cascade Ranking in Multi-Stage Retrieval , 2017, SIGIR.

[33]  Zheng Wen,et al.  Combinatorial Cascading Bandits , 2015, NIPS.

[34]  Shuai Li,et al.  Contextual Combinatorial Cascading Bandits , 2016, ICML.

[35]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[36]  Zheng Wen,et al.  DCM Bandits: Learning to Rank with Multiple Clicks , 2016, ICML.