Estimating interleaved comparison outcomes from historical click data

Interleaved comparison methods, which compare rankers using click data, are a promising alternative to traditional information retrieval evaluation methods that require expensive explicit judgments. A major limitation of these methods is that they assume access to live data, meaning that new data must be collected for every pair of rankers compared. We investigate the use of previously collected click data (i.e., historical data) for interleaved comparisons. We start by analyzing to what degree existing interleaved comparison methods can be applied and find that a recent probabilistic method allows such data reuse, even though it is biased when applied to historical data. We then propose an interleaved comparison method that is based on the probabilistic approach but uses importance sampling to compensate for bias. We experimentally confirm that probabilistic methods make the use of historical data for interleaved comparisons possible and effective.

[1]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[2]  Filip Radlinski,et al.  Comparing the sensitivity of information retrieval metrics , 2010, SIGIR.

[3]  Katja Hofmann,et al.  A probabilistic method for inferring preferences from clicks , 2011, CIKM '11.

[4]  Lihong Li,et al.  Learning from Logged Implicit Exploration Data , 2010, NIPS.

[5]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[6]  Ben Carterette,et al.  Evaluating Search Engines by Modeling the Relationship Between Relevance and Clicks , 2007, NIPS.

[7]  Ben Carterette,et al.  Robust test collections for retrieval evaluation , 2007, SIGIR.

[8]  Hongyuan Zha,et al.  Global ranking by exploiting user clicks , 2009, SIGIR.

[9]  Andrew Trotman,et al.  Comparative analysis of clicks and judgments for IR evaluation , 2009, WSCD '09.

[10]  Umut Ozertem,et al.  Evaluating new search engine configurations with pre-existing judgments and clicks , 2011, WWW.

[11]  Ciya Liao,et al.  A model to estimate intrinsic document relevance from the clickthrough logs of a web search engine , 2010, WSDM '10.

[12]  Filip Radlinski,et al.  How does clickthrough data reflect retrieval quality? , 2008, CIKM '08.

[13]  Katja Hofmann,et al.  Comparing click-through data to purchase decisions for retrieval evaluation , 2010, SIGIR '10.

[14]  Wei Chu,et al.  Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[15]  Jaap Kamps,et al.  A Search Log-Based Approach to Evaluation , 2010, ECDL.

[16]  Benjamin Piwowarski,et al.  Web Search Engine Evaluation Using Clickthrough Data and a User Model , 2007 .

[17]  ChengXiang Zhai,et al.  Evaluation of methods for relative comparison of retrieval systems based on clickthroughs , 2009, CIKM.

[18]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[19]  Jonathan L. Herlocker,et al.  Click data as implicit relevance feedback in web search , 2007, Inf. Process. Manag..

[20]  Chao Liu,et al.  Efficient multiple-click models in web search , 2009, WSDM '09.