Gathering Additional Feedback on Search Results by Multi-Armed Bandits with Respect to Production Ranking

Given a repeatedly issued query and a document with a not-yet-confirmed potential to satisfy the users' needs, a search system should place this document on a high position in order to gather user feedback and obtain a more confident estimate of the document utility. On the other hand, the main objective of the search system is to maximize expected user satisfaction over a rather long period, what requires showing more relevant documents on average. The state-of-the-art approaches to solving this exploration-exploitation dilemma rely on strongly simplified settings making these approaches infeasible in practice. We improve the most flexible and pragmatic of them to handle some actual practical issues. The first one is utilizing prior information about queries and documents, the second is combining bandit-based learning approaches with a default production ranking algorithm. We show experimentally that our framework enables to significantly improve the ranking of a leading commercial search engine.

[1]  Ryen W. White,et al.  Personalizing web search results by reading level , 2011, CIKM '11.

[2]  Filip Radlinski,et al.  Learning diverse rankings with multi-armed bandits , 2008, ICML '08.

[3]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[4]  Nick Craswell,et al.  An experimental comparison of click position-bias models , 2008, WSDM '08.

[5]  Milad Shokouhi,et al.  Time-sensitive query auto-completion , 2012, SIGIR '12.

[6]  Wei Chu,et al.  Modeling the impact of short- and long-term behavior on search personalization , 2012, SIGIR '12.

[7]  Hsuan-Tien Lin,et al.  An Ensemble Ranking Solution for the Yahoo ! Learning to Rank Challenge , 2010 .

[8]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[9]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[10]  Thorsten Joachims,et al.  Interactively optimizing information retrieval systems as a dueling bandits problem , 2009, ICML '09.

[11]  Ingemar J. Cox,et al.  Risky business: modeling and exploiting uncertainty in information retrieval , 2009, SIGIR.

[12]  John A. Nelder,et al.  A Simplex Method for Function Minimization , 1965, Comput. J..

[13]  Fernando Diaz,et al.  Integration of news content into web results , 2009, WSDM '09.

[14]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[15]  Jun Wang,et al.  Iterative Expectation for Multi Period Information Retrieval , 2013, ArXiv.

[16]  Thorsten Joachims,et al.  Stable Coactive Learning via Perturbation , 2013, ICML.

[17]  Liang Tang,et al.  Automatic ad format selection via contextual bandits , 2013, CIKM.

[18]  Chao Liu,et al.  Efficient multiple-click models in web search , 2009, WSDM '09.

[19]  Lihong Li,et al.  Counterfactual Estimation and Optimization of Click Metrics for Search Engines , 2014, ArXiv.

[20]  Iadh Ounis,et al.  Query performance prediction , 2006, Inf. Syst..

[21]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[22]  Filip Radlinski,et al.  Ranked bandits in metric spaces: learning diverse rankings over large document collections , 2013, J. Mach. Learn. Res..

[23]  Wei Chu,et al.  Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[24]  Filip Radlinski,et al.  Inferring and using location metadata to personalize web search , 2011, SIGIR.

[25]  Yi Chang,et al.  A unified search federation system based on online user feedback , 2013, KDD.

[26]  W. Bruce Croft,et al.  Query performance prediction in web search environments , 2007, SIGIR.

[27]  Aurélien Garivier,et al.  On Bayesian Upper Confidence Bounds for Bandit Problems , 2012, AISTATS.

[28]  Fernando Diaz,et al.  Adaptation of offline vertical selection predictions in the presence of user feedback , 2009, SIGIR.

[29]  John Langford,et al.  Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits , 2014, ICML.

[30]  Wei Chu,et al.  Refining Recency Search Results with User Click Feedback , 2011, ArXiv.

[31]  Katja Hofmann,et al.  Information Retrieval manuscript No. (will be inserted by the editor) Balancing Exploration and Exploitation in Listwise and Pairwise Online Learning to Rank for Information Retrieval , 2022 .

[32]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[33]  Andreas Krause,et al.  Explore-exploit in top-N recommender systems via Gaussian processes , 2014, RecSys '14.

[34]  Michael J. Best,et al.  Active set algorithms for isotonic regression; A unifying framework , 1990, Math. Program..

[35]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[36]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[37]  M. de Rijke,et al.  Click model-based information retrieval metrics , 2013, SIGIR.