Learning to Rank

Algorithms for learning to rank Web documents, display ads, or other types of items constitute a fundamental component of search engines and more generally of online services. In such systems, when a user makes a request or visits a web page, an ordered list of items (e.g. documents or ads) is displayed; the user scans this list in order, and clicks on the first relevant item if any. When the user clicks on an item, the reward collected by the system typically decreases with the position of the item in the displayed list. The main challenge in the design of sequential list selection algorithms stems from the fact that the probabilities with which the user clicks on the various items are unknown and need to be learned. We formulate the design of such algorithms as a stochastic bandit optimization problem. This problem differs from the classical bandit framework: (1) the type of feedback received by the system depends on the actual relevance of the various items in the displayed list (if the user clicks on the last item, we know that none of the previous items in the list are relevant); (2) there are inherent correlations between the average relevance of the items (e.g. the user may be interested in a specific topic only). We assume that items are categorized according to their topic and that users are clustered, so that users of the same cluster are interested in the same topic. We investigate several scenarios depending on the available side-information on the user before selecting the displayed list: (a) we first treat the case where the topic the user is interested in is known when she places a request; (b) we then study the case where the user cluster is known but the mapping between user clusters and topics is unknown. For both scenarios, we derive regret lower bounds and devise algorithms that approach these fundamental limits.

[1]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[2]  R. Agrawal The Continuum-Armed Bandit Problem , 1995 .

[3]  T. L. Graves,et al.  Asymptotically Efficient Adaptive Choice of Control Laws inControlled Markov Chains , 1997 .

[4]  Samir Khuller,et al.  The Budgeted Maximum Coverage Problem , 1999, Inf. Process. Lett..

[5]  Adam Tauman Kalai,et al.  Online convex optimization in the bandit setting: gradient descent without a gradient , 2004, SODA '05.

[6]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[7]  Deepayan Chakrabarti,et al.  Bandits for Taxonomies: A Model-based Approach , 2007, SDM.

[8]  Filip Radlinski,et al.  Active exploration for learning rankings from clickthrough data , 2007, KDD '07.

[9]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[10]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[11]  Eli Upfal,et al.  Multi-Armed Bandits in Metric Spaces ∗ , 2008 .

[12]  Csaba Szepesvári,et al.  Online Optimization in X-Armed Bandits , 2008, NIPS.

[13]  Filip Radlinski,et al.  Learning diverse rankings with multi-armed bandits , 2008, ICML '08.

[14]  Thorsten Joachims,et al.  Interactively optimizing information retrieval systems as a dueling bandits problem , 2009, ICML '09.

[15]  Andreas Krause,et al.  Online Learning of Assignments , 2009, NIPS.

[16]  Amin Saberi,et al.  Correlation robust stochastic optimization , 2009, SODA '10.

[17]  Filip Radlinski,et al.  Learning optimally diverse rankings over large document collections , 2010, ICML.

[18]  Yisong Yue,et al.  Linear Submodular Bandits and their Application to Diversified Retrieval , 2011, NIPS.

[19]  Aurélien Garivier,et al.  The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond , 2011, COLT.

[20]  Shie Mannor,et al.  Clustered Bandits , 2012, ArXiv.

[21]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[22]  Pushmeet Kohli,et al.  A Fast Bandit Algorithm for Recommendation to Users With Heterogenous Tastes , 2013, AAAI.

[23]  Filip Radlinski,et al.  Ranked bandits in metric spaces: learning diverse rankings over large document collections , 2013, J. Mach. Learn. Res..

[24]  Alexandre Proutière,et al.  Unimodal Bandits: Regret Lower Bounds and Optimal Algorithms , 2014, ICML.

[25]  Alexandre Proutière,et al.  Lipschitz Bandits: Regret Lower Bound and Optimal Algorithms , 2014, COLT.

[26]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .