Fidelity, Soundness, and Efficiency of Interleaved Comparison Methods

Ranker evaluation is central to the research into search engines, be it to compare rankers or to provide feedback for learning to rank. Traditional evaluation approaches do not scale well because they require explicit relevance judgments of document-query pairs, which are expensive to obtain. A promising alternative is the use of interleaved comparison methods, which compare rankers using click data obtained when interleaving their rankings. In this article, we propose a framework for analyzing interleaved comparison methods. An interleaved comparison method has fidelity if the expected outcome of ranker comparisons properly corresponds to the true relevance of the ranked documents. It is sound if its estimates of that expected outcome are unbiased and consistent. It is efficient if those estimates are accurate with only little data. We analyze existing interleaved comparison methods and find that, while sound, none meet our criteria for fidelity. We propose a probabilistic interleave method, which is sound and has fidelity. We show empirically that, by marginalizing out variables that are known, it is more efficient than existing interleaved comparison methods. Using importance sampling we derive a sound extension that is able to reuse historical data collected in previous comparisons of other ranker pairs.

[1]  Benjamin Piwowarski,et al.  Web Search Engine Evaluation Using Clickthrough Data and a User Model , 2007 .

[2]  Katja Hofmann,et al.  Reusing historical interaction data for faster online learning to rank for IR , 2013, DIR.

[3]  Lihong Li,et al.  Learning from Logged Implicit Exploration Data , 2010, NIPS.

[4]  R.P. Lippmann,et al.  Pattern classification using neural networks , 1989, IEEE Communications Magazine.

[5]  Eric Brill,et al.  Improving web search ranking by incorporating user behavior information , 2006, SIGIR.

[6]  Olivier Chapelle,et al.  A dynamic bayesian network click model for web search ranking , 2009, WWW '09.

[7]  Milad Shokouhi,et al.  Using Clicks as Implicit Judgments: Expectations Versus Observations , 2008, ECIR.

[8]  Filip Radlinski,et al.  Large-scale validation and analysis of interleaved search evaluation , 2012, TOIS.

[9]  Wei Chu,et al.  Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[10]  Susan T. Dumais,et al.  Improving Web Search Ranking by Incorporating User Behavior Information , 2019, SIGIR Forum.

[11]  B. E. Eckbo,et al.  Appendix , 1826, Epilepsy Research.

[12]  E. Lehmann Elements of large-sample theory , 1998 .

[13]  F. Graybill,et al.  Combining Unbiased Estimators , 1959 .

[14]  Filip Radlinski,et al.  Optimized interleaving for online retrieval evaluation , 2013, WSDM.

[15]  ChengXiang Zhai,et al.  Evaluation of methods for relative comparison of retrieval systems based on clickthroughs , 2009, CIKM.

[16]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[17]  Filip Radlinski,et al.  Learning diverse rankings with multi-armed bandits , 2008, ICML '08.

[18]  Steve Fox,et al.  Evaluating implicit measures to improve web search , 2005, TOIS.

[19]  P. Halmos The Theory of Unbiased Estimation , 1946 .

[20]  Filip Radlinski,et al.  Comparing the sensitivity of information retrieval metrics , 2010, SIGIR.

[21]  Katja Hofmann,et al.  A probabilistic method for inferring preferences from clicks , 2011, CIKM '11.

[22]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[23]  Hongyuan Zha,et al.  Global ranking by exploiting user clicks , 2009, SIGIR.

[24]  Filip Radlinski,et al.  Detecting duplicate web documents using clickthrough data , 2011, WSDM '11.

[25]  David J. C. Mackay,et al.  Introduction to Monte Carlo Methods , 1998, Learning in Graphical Models.

[26]  John Langford,et al.  Exploration scavenging , 2008, ICML '08.

[27]  Yiqun Liu,et al.  Automatic Query Type Identification Based on Click Through Information , 2006, AIRS.

[28]  Jaap Kamps,et al.  A Search Log-Based Approach to Evaluation , 2010, ECDL.

[29]  Gobinda G. Chowdhury,et al.  TREC: Experiment and Evaluation in Information Retrieval , 2007 .

[30]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[31]  Filip Radlinski,et al.  How does clickthrough data reflect retrieval quality? , 2008, CIKM '08.

[32]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[33]  Katja Hofmann,et al.  Comparing click-through data to purchase decisions for retrieval evaluation , 2010, SIGIR '10.

[34]  Daniel E. Rose,et al.  Understanding user goals in web search , 2004, WWW '04.

[35]  Katja Hofmann,et al.  Estimating interleaved comparison outcomes from historical click data , 2012, CIKM '12.

[36]  Jonathan L. Herlocker,et al.  Click data as implicit relevance feedback in web search , 2007, Inf. Process. Manag..

[37]  Andrew Trotman,et al.  Comparative analysis of clicks and judgments for IR evaluation , 2009, WSCD '09.

[38]  Yuguo Chen Another look at rejection sampling through importance sampling , 2005 .

[39]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[40]  Xuehua Shen,et al.  Context-sensitive information retrieval using implicit feedback , 2005, SIGIR '05.

[41]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[42]  M. de Rijke,et al.  Click model-based information retrieval metrics , 2013, SIGIR.

[43]  Thorsten Joachims,et al.  Evaluating Retrieval Performance Using Clickthrough Data , 2003, Text Mining.

[44]  Chao Liu,et al.  Efficient multiple-click models in web search , 2009, WSDM '09.

[45]  Paul S. Levy,et al.  Combining unbiased estimates —a further examination of some old estimators , 1974 .

[46]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[47]  G. Glauberman Proof of Theorem A , 1977 .

[48]  Yue Gao,et al.  Learning more powerful test statistics for click-based retrieval evaluation , 2010, SIGIR.

[49]  Ellen M. Voorhees,et al.  TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing) , 2005 .

[50]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[51]  Ben Carterette,et al.  Evaluating Search Engines by Modeling the Relationship Between Relevance and Clicks , 2007, NIPS.

[52]  Ciya Liao,et al.  A model to estimate intrinsic document relevance from the clickthrough logs of a web search engine , 2010, WSDM '10.

[53]  Wei-Ying Ma,et al.  Optimizing web search using web click-through data , 2004, CIKM '04.

[54]  Filip Radlinski,et al.  On caption bias in interleaving experiments , 2012, CIKM '12.

[55]  Umut Ozertem,et al.  Evaluating new search engine configurations with pre-existing judgments and clicks , 2011, WWW.