BubbleRank: Safe Online Learning to Re-Rank via Implicit Click Feedback

In this paper, we study the problem of safe online learning to re-rank, where user feedback is used to improve the quality of displayed lists. Learning to rank has traditionally been studied in two settings. In the offline setting, rankers are typically learned from relevance labels created by judges. This approach has generally become standard in industrial applications of ranking, such as search. However, this approach lacks exploration and thus is limited by the information content of the offline training data. In the online setting, an algorithm can experiment with lists and learn from feedback on them in a sequential fashion. Bandit algorithms are well-suited for this setting but they tend to learn user preferences from scratch, which results in a high initial cost of exploration. This poses an additional challenge of safe exploration in ranked lists. We propose BubbleRank, a bandit algorithm for safe re-ranking that combines the strengths of both the offline and online settings. The algorithm starts with an initial base list and improves it online by gradually exchanging higher-ranked less attractive items for lower-ranked more attractive items. We prove an upper bound on the n-step regret of BubbleRank that degrades gracefully with the quality of the initial base list. Our theoretical findings are supported by extensive experiments on a large-scale real-world click dataset.

[1]  Marc Najork,et al.  Position Bias Estimation for Unbiased Learning to Rank in Personal Search , 2018, WSDM.

[2]  Zheng Wen,et al.  Combinatorial Cascading Bandits , 2015, NIPS.

[3]  Yifan Wu,et al.  Conservative Bandits , 2016, ICML.

[4]  Katja Hofmann,et al.  Reusing historical interaction data for faster online learning to rank for IR , 2013, DIR.

[5]  Lihong Li,et al.  Learning from Logged Implicit Exploration Data , 2010, NIPS.

[6]  Gleb Gusev,et al.  Gathering Additional Feedback on Search Results by Multi-Armed Bandits with Respect to Production Ranking , 2015, WWW.

[7]  Andrew Trotman,et al.  The Architecture of eBay Search , 2017, eCOM@SIGIR.

[8]  Nick Craswell,et al.  An experimental comparison of click position-bias models , 2008, WSDM '08.

[9]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[10]  Chao Liu,et al.  Efficient multiple-click models in web search , 2009, WSDM '09.

[11]  Tao Qin,et al.  LETOR: A benchmark collection for research on learning to rank for information retrieval , 2010, Information Retrieval.

[12]  Shubhra Kanti Karmaker Santu,et al.  On Application of Learning to Rank for E-Commerce Search , 2017, SIGIR.

[13]  M. de Rijke,et al.  Click Models for Web Search , 2015, Click Models for Web Search.

[14]  Zheng Wen,et al.  DCM Bandits: Learning to Rank with Multiple Clicks , 2016, ICML.

[15]  Martin Wattenberg,et al.  Ad click prediction: a view from the trenches , 2013, KDD.

[16]  Benjamin Van Roy,et al.  Conservative Contextual Linear Bandits , 2016, NIPS.

[17]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[18]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[19]  Olivier Cappé,et al.  Multiple-Play Bandits in the Position-Based Model , 2016, NIPS.

[20]  Shuai Li,et al.  TopRank: A practical algorithm for online stochastic ranking , 2018, NeurIPS.

[21]  Ronald L. Rivest,et al.  Introduction to Algorithms, 3rd Edition , 2009 .

[22]  Meng Zhao,et al.  A Practical Deep Online Ranking System in E-commerce Recommendation , 2018, ECML/PKDD.

[23]  Zheng Wen,et al.  Cascading Bandits for Large-Scale Recommendation Problems , 2016, UAI.

[24]  Matthew Richardson,et al.  Predicting clicks: estimating the click-through rate for new ads , 2007, WWW '07.

[25]  Filip Radlinski,et al.  Minimally Invasive Randomization for Collecting Unbiased Preferences from Clickthrough Logs , 2006, AAAI 2006.

[26]  Shuai Li,et al.  Contextual Combinatorial Cascading Bandits , 2016, ICML.

[27]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[28]  Filip Radlinski,et al.  Learning diverse rankings with multi-armed bandits , 2008, ICML '08.

[29]  Luo Si,et al.  Cascade Ranking for Operational E-commerce Search , 2017, KDD.

[30]  James Allan,et al.  TREC 2017 Common Core Track Overview , 2017, TREC.

[31]  John Langford,et al.  Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits , 2014, ICML.

[32]  Yisong Yue,et al.  Linear Submodular Bandits and their Application to Diversified Retrieval , 2011, NIPS.

[33]  Nicolò Cesa-Bianchi,et al.  Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[34]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[35]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[36]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[37]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[38]  Filip Radlinski,et al.  Ranked bandits in metric spaces: learning diverse rankings over large document collections , 2013, J. Mach. Learn. Res..

[39]  Alexandre Proutière,et al.  Learning to Rank , 2015, SIGMETRICS.

[40]  Marc Najork,et al.  Learning to Rank with Selection Bias in Personal Search , 2016, SIGIR.

[41]  Zheng Wen,et al.  Cascading Bandits: Learning to Rank in the Cascade Model , 2015, ICML.

[42]  J. Shane Culpepper,et al.  Efficient Cost-Aware Cascade Ranking in Multi-Stage Retrieval , 2017, SIGIR.

[43]  Wei Chu,et al.  Online learning for recency search ranking using real-time user feedback , 2010, CIKM '10.

[44]  M. de Rijke,et al.  Click-based Hot Fixes for Underperforming Torso Queries , 2016, SIGIR.

[45]  Csaba Szepesvári,et al.  Online Learning to Rank in Stochastic Click Models , 2017, ICML.