Information Retrieval manuscript No. (will be inserted by the editor) Balancing Exploration and Exploitation in Listwise and Pairwise Online Learning to Rank for Information Retrieval

As retrieval systems become more complex, learning to rank approaches are being developed to automatically tune their parameters. Using online learning to rank, retrieval systems can learn directly from implicit feedback inferred from user interactions. In such an online setting, algorithms must obtain feedback for effective learning while simultaneously utilizing what has already been learned to produce high quality results. We formulate this challenge as an exploration–exploitation dilemma and propose two methods for addressing it. By adding mechanisms for balancing exploration and exploitation during learning, each method extends a state-of-the-art learning to rank method, one based on listwise learning and the other on pairwise learning. Using a recently developed simulation framework that allows assessment of online performance, we empirically evaluate both methods. Our results show that balancing exploration and exploitation can substantially and significantly improve the online retrieval performance of both listwise and pairwise approaches. In addition, the results demonstrate that such a balance affects the two approaches in different ways, especially when user feedback is noisy, yielding new insights relevant to making online learning to rank effective in practice.

[1]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[2]  C. Watkins Learning from delayed rewards , 1989 .

[3]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[4]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[5]  Klaus Obermayer,et al.  Support vector learning for ordinal regression , 1999 .

[6]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[7]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[8]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[9]  Yi Zhang,et al.  Exploration and Exploitation in Adaptive Filtering Based on Bayesian Active Learning , 2003, ICML.

[10]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[11]  Richard S. Sutton,et al.  Associative search network: A reinforcement learning associative memory , 1981, Biological Cybernetics.

[12]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[13]  Chris Mesterharm,et al.  Experience-efficient learning in associative bandit problems , 2006, ICML.

[14]  Shimon Whiteson,et al.  On-line evolutionary computation for reinforcement learning in stochastic domains , 2006, GECCO.

[15]  Angela J. Yu,et al.  Should I stay or should I go? How the human brain manages the trade-off between exploitation and exploration , 2007, Philosophical Transactions of the Royal Society B: Biological Sciences.

[16]  Filip Radlinski,et al.  Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search , 2007, TOIS.

[17]  John Langford,et al.  The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information , 2007, NIPS.

[18]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[19]  Tao Qin,et al.  LETOR: Benchmark Dataset for Research on Learning to Rank for Information Retrieval , 2007 .

[20]  Yi Zhang,et al.  Incorporating Diversity and Density in Active Learning for Relevance Feedback , 2007, ECIR.

[21]  Angela J. Yu,et al.  the trade-off between exploitation and exploration Should I stay or should I go ? How the human brain manages , 2008 .

[22]  Demosthenis Teneketzis,et al.  Multi-Armed Bandit Problems , 2008 .

[23]  Nick Craswell,et al.  An experimental comparison of click position-bias models , 2008, WSDM '08.

[24]  T. Minka Selection bias in the LETOR datasets , 2008 .

[25]  Stephen E. Robertson,et al.  SoftRank: optimizing non-smooth rank metrics , 2008, WSDM '08.

[26]  Filip Radlinski,et al.  How does clickthrough data reflect retrieval quality? , 2008, CIKM '08.

[27]  Filip Radlinski,et al.  Learning diverse rankings with multi-armed bandits , 2008, ICML '08.

[28]  Deepak Agarwal,et al.  Online Models for Content Optimization , 2008, NIPS.

[29]  Ram Akella,et al.  A bayesian logistic regression model for active relevance feedback , 2008, SIGIR '08.

[30]  John Langford,et al.  Exploration scavenging , 2008, ICML '08.

[31]  Thorsten Joachims,et al.  Interactively optimizing information retrieval systems as a dueling bandits problem , 2009, ICML '09.

[32]  D. Sculley,et al.  Large Scale Learning to Rank , 2009 .

[33]  Thorsten Joachims,et al.  The K-armed Dueling Bandits Problem , 2012, COLT.

[34]  Chao Liu,et al.  Efficient multiple-click models in web search , 2009, WSDM '09.

[35]  Christos Faloutsos,et al.  Tailoring click models to user goals , 2009, WSCD '09.

[36]  Jaime G. Carbonell,et al.  Active Sampling for Rank Learning via Optimizing the Area under the ROC Curve , 2009, ECIR.

[37]  ChengXiang Zhai,et al.  Evaluation of methods for relative comparison of retrieval systems based on clickthroughs , 2009, CIKM.

[38]  ChengXiang Zhai,et al.  Exploration-exploitation tradeoff in interactive relevance feedback , 2010, CIKM '10.

[39]  Keith D. Kastella,et al.  Foundations and Applications of Sensor Management , 2010 .

[40]  Mark Sanderson,et al.  Test Collection Based Evaluation of Information Retrieval Systems , 2010, Found. Trends Inf. Retr..

[41]  Peter Stone,et al.  Efficient Selection of Multiple Bandit Arms: Theory and Practice , 2010, ICML.

[42]  Thorsten Joachims,et al.  Fast Active Exploration for Link-Based Preference Learning Using Gaussian Processes , 2010, ECML/PKDD.

[43]  Filip Radlinski,et al.  Comparing the sensitivity of information retrieval metrics , 2010, SIGIR.

[44]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[45]  Katja Hofmann,et al.  Balancing Exploration and Exploitation in Learning to Rank Online , 2011, ECIR.

[46]  Matthew Lease,et al.  Active learning to maximize accuracy vs. effort in interactive information retrieval , 2011, SIGIR.

[47]  Katja Hofmann,et al.  A probabilistic method for inferring preferences from clicks , 2011, CIKM '11.

[48]  Tie-Yan Liu,et al.  Learning to Rank for Information Retrieval , 2011 .

[49]  Wei Chu,et al.  Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.