Expected reciprocal rank for graded relevance

While numerous metrics for information retrieval are available in the case of binary relevance, there is only one commonly used metric for graded relevance, namely the Discounted Cumulative Gain (DCG). A drawback of DCG is its additive nature and the underlying independence assumption: a document in a given position has always the same gain and discount independently of the documents shown above it. Inspired by the "cascade" user model, we present a new editorial metric for graded relevance which overcomes this difficulty and implicitly discounts documents which are shown below very relevant documents. More precisely, this new metric is defined as the expected reciprocal length of time that the user will take to find a relevant document. This can be seen as an extension of the classical reciprocal rank to the graded relevance case and we call this metric Expected Reciprocal Rank (ERR). We conduct an extensive evaluation on the query logs of a commercial search engine and show that ERR correlates better with clicks metrics than other editorial metrics.

[1]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[2]  Tetsuya Sakai,et al.  Alternatives to Bpref , 2007, SIGIR.

[3]  Ben Carterette,et al.  Evaluating Search Engines by Modeling the Relationship Between Relevance and Clicks , 2007, NIPS.

[4]  Charles L. A. Clarke,et al.  Reliable information retrieval evaluation with incomplete and biased judgements , 2007, SIGIR.

[5]  James Allan,et al.  Minimal test collections for retrieval evaluation , 2006, SIGIR.

[6]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[7]  Matthew Richardson,et al.  Predicting clicks: estimating the click-through rate for new ads , 2007, WWW '07.

[8]  C. Cleverdon Report on the testing and analysis of an investigation into comparative efficiency of indexing systems , 1962 .

[9]  Cyril W. Cleverdon,et al.  Aslib Cranfield research project: report on the testing and analysis of an investigation into the comparative efficiency of indexing systems , 1962 .

[10]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track Evaluation , 2000, TREC.

[11]  Jaana Kekäläinen,et al.  Binary and graded relevance in IR evaluations--Comparison of the effects on ranking of IR systems , 2005, Inf. Process. Manag..

[12]  Charles L. A. Clarke,et al.  Novelty and diversity in information retrieval evaluation , 2008, SIGIR '08.

[13]  Nick Craswell,et al.  An experimental comparison of click position-bias models , 2008, WSDM '08.

[14]  Tefko Saracevic Relevance: A review of the literature and a framework for thinking on the notion in information science. Part III: Behavior and effects of relevance , 2007 .

[15]  I. Good,et al.  The Amalgamation and Geometry of Two-by-Two Contingency Tables , 1987 .

[16]  Mark Sanderson,et al.  The relationship between IR effectiveness measures and user satisfaction , 2007, SIGIR.

[17]  Donna K. Harman,et al.  Overview of the TREC 2002 Novelty Track , 2002, TREC.

[18]  W. S. Cooper Expected search length: A single measure of retrieval effectiveness based on the weak ordering action of retrieval systems , 1968 .

[19]  Sreenivas Gollapudi,et al.  Diversifying search results , 2009, WSDM '09.

[20]  Emine Yilmaz,et al.  Inferring document relevance from incomplete information , 2007, CIKM '07.

[21]  Stephen E. Robertson,et al.  Modelling A User Population for Designing Information Retrieval Metrics , 2008, EVIA@NTCIR.

[22]  Filip Radlinski,et al.  How does clickthrough data reflect retrieval quality? , 2008, CIKM '08.

[23]  Paul Over,et al.  TREC-7 Interactive Track Report , 1998, TREC.

[24]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[25]  Ellen M. Voorhees,et al.  The Philosophy of Information Retrieval Evaluation , 2001, CLEF.

[26]  Olivier Chapelle,et al.  A dynamic bayesian network click model for web search ranking , 2009, WWW '09.