Simple Evaluation Metrics for Diversified Search Results

Traditional information retrieval research has mostly focussed on satisfying clearly specified information needs. However, in reality, queries are often ambiguous and/or underspecified. In light of this, evaluating search result diversity is beginning to receive attention. We propose simple evaluation metrics for diversified Web search results. Our presumptions are that one or more interpretations (or intents) are possible for each given query, and that graded relevance assessments are available for intent-document pairs (as opposed to query-document pairs). Our goals are (a) to retrieve documents that cover as many intents as possible; and (b) to rank documents that are highly relevant to more popular intents higher than those that are marginally relevant to less popular intents. Unlike the Intent-Aware (IA) metrics proposed by Agrawal et al., our metrics successfully avoid ignoring minor intents. Unlike α-nDCG proposed by Clarke et al., our metrics can accomodate (i) which intents are more likely than others for a given query; and (ii) graded relevance within each intent. Furthermore, unlike these existing metrics, our metrics do not require approximation, and they range between 0 and 1. Experiments with the binary-relevance Diversity Task data from the TREC 2009 Web Track suggest that our metrics corrrelate well with existing metrics but can be more intuitive. Hence, we argue that our metrics are suitable for diversity evaluation given either the intent likelihood information or per-intent graded relevance, or preferably both.

[1]  Mark Sanderson,et al.  Ambiguous queries: test collections need more sense , 2008, SIGIR '08.

[2]  Paul Over,et al.  TREC-7 Interactive Track Report , 1998, TREC.

[3]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[4]  Tetsuya Sakai,et al.  New Performance Metrics Based on Multigrade Relevance: Their Application to Question Answering , 2004, NTCIR.

[5]  Ben Carterette,et al.  Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval , 2009 .

[6]  Charles L. A. Clarke,et al.  Novelty and diversity in information retrieval evaluation , 2008, SIGIR '08.

[7]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[8]  S. Robertson The probability ranking principle in IR , 1997 .

[9]  Tetsuya Sakai,et al.  On the Properties of Evaluation Metrics for Finding One Highly Relevant Document , 2007 .

[10]  Filip Radlinski,et al.  Redundancy, diversity and interdependent document relevance , 2009, SIGF.

[11]  Tetsuya Sakai,et al.  Evaluating Information Retrieval Metrics Based on Bootstrap Hypothesis Tests , 2007 .

[12]  Charles L. A. Clarke,et al.  An Effectiveness Measure for Ambiguous and Underspecified Queries , 2009, ICTIR.

[13]  Noriko Kando,et al.  On information retrieval metrics designed for evaluation with incomplete relevance assessments , 2008, Information Retrieval.

[14]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[15]  Stephen E. Robertson,et al.  Modelling A User Population for Designing Information Retrieval Metrics , 2008, EVIA@NTCIR.

[16]  Stephen E. Robertson,et al.  A new rank correlation coefficient for information retrieval , 2008, SIGIR '08.

[17]  Tetsuya Sakai,et al.  Constructing a Test Collection with Multi-Intent Queries , 2010, EVIA@NTCIR.

[18]  Stephen E. Robertson,et al.  Ambiguous requests: implications for retrieval tests, systems and theories , 2007, SIGF.

[19]  Alistair Moffat,et al.  Against recall: is it persistence, cardinality, density, coverage, or totality? , 2009, SIGF.

[20]  David R. Karger,et al.  Less is More Probabilistic Models for Retrieving Fewer Relevant Documents , 2006 .

[21]  Sreenivas Gollapudi,et al.  Diversifying search results , 2009, WSDM '09.