On the reliability and intuitiveness of aggregated search metrics

Aggregating search results from a variety of diverse verticals such as news, images, videos and Wikipedia into a single interface is a popular web search presentation paradigm. Although several aggregated search (AS) metrics have been proposed to evaluate AS result pages, their properties remain poorly understood. In this paper, we compare the properties of existing AS metrics under the assumptions that (1) queries may have multiple preferred verticals; (2) the likelihood of each vertical preference is available; and (3) the topical relevance assessments of results returned from each vertical is available. We compare a wide range of AS metrics on two test collections. Our main criteria of comparison are (1) discriminative power, which represents the reliability of a metric in comparing the performance of systems, and (2) intuitiveness, which represents how well a metric captures the various key aspects to be measured (i.e. various aspects of a user's perception of AS result pages). Our study shows that the AS metrics that capture key AS components (e.g., vertical selection) have several advantages over other metrics. This work sheds new lights on the further developments and applications of AS metrics.

[1]  Fernando Diaz,et al.  Learning to aggregate vertical results into web search results , 2011, CIKM '11.

[2]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[3]  Joemon M. Jose,et al.  Evaluating reward and risk for vertical selection , 2012, CIKM '12.

[4]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[5]  Tetsuya Sakai,et al.  Evaluating evaluation metrics based on the bootstrap , 2006, SIGIR.

[6]  Fernando Diaz,et al.  Sources of evidence for vertical selection , 2009, SIGIR.

[7]  Xiao Li,et al.  Learning query intent from regularized click graphs , 2008, SIGIR '08.

[8]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[9]  Fernando Diaz,et al.  A Methodology for Evaluating Aggregated Search Results , 2011, ECIR.

[10]  Tetsuya Sakai,et al.  Evaluating diversified search results using per-intent graded relevance , 2011, SIGIR.

[11]  Charles L. A. Clarke,et al.  Novelty and diversity in information retrieval evaluation , 2008, SIGIR '08.

[12]  Ben Carterette,et al.  Multiple testing in statistical analysis of systems-based information retrieval experiments , 2012, TOIS.

[13]  W. Bruce Croft,et al.  Searching distributed collections with inference networks , 1995, SIGIR '95.

[14]  Tetsuya Sakai Evaluation with informational and navigational intents , 2012, WWW.

[15]  Djoerd Hiemstra,et al.  Federated search in the wild: the combined power of over a hundred search engines , 2012, CIKM '12.

[16]  Ke Zhou,et al.  Evaluating large-scale distributed vertical search , 2011, LSDS-IR '11.

[17]  Craig MacDonald,et al.  Aggregated Search Result Diversification , 2011, ICTIR.

[18]  Tetsuya Sakai,et al.  Diversified search evaluation: lessons from the NTCIR-9 INTENT task , 2012, Information Retrieval.

[19]  Sreenivas Gollapudi,et al.  Diversifying search results , 2009, WSDM '09.

[20]  Ashwin Satyanarayana,et al.  Evaluating whole-page relevance , 2010, SIGIR '10.

[21]  Djoerd Hiemstra,et al.  Federated Search in the Wild , 2012, CIKM 2012.

[22]  Mark Sanderson,et al.  Do user preferences and evaluation measures line up? , 2010, SIGIR.

[23]  Tapas Kanungo,et al.  On composition of a federated web search result page: using online users to provide pairwise preference for heterogeneous verticals , 2011, WSDM '11.

[24]  Joemon M. Jose,et al.  Evaluating aggregated search pages , 2012, SIGIR '12.

[25]  Milad Shokouhi,et al.  Central-Rank-Based Collection Selection in Uncooperative Distributed Information Retrieval , 2007, ECIR.

[26]  Charles L. A. Clarke,et al.  A comparative analysis of cascade measures for novelty and diversity , 2011, WSDM '11.

[27]  Martin Halvey,et al.  Assessing and Predicting Vertical Intent for Web Queries , 2012, ECIR.