Evaluating aggregated search pages

Aggregating search results from a variety of heterogeneous sources or verticals such as news, image and video into a single interface is a popular paradigm in web search. Although various approaches exist for selecting relevant verticals or optimising the aggregated search result page, evaluating the quality of an aggregated page is an open question. This paper proposes a general framework for evaluating the quality of aggregated search pages. We evaluate our approach by collecting annotated user preferences over a set of aggregated search pages for 56 topics and 12 verticals. We empirically demonstrate the fidelity of metrics instantiated from our proposed framework by showing that they strongly agree with the annotated user preferences of pairs of simulated aggregated pages. Furthermore, we show that our metrics agree with the majority preference more often than current diversity-based information retrieval metrics. Finally, we demonstrate the flexibility of our framework by showing that personalised historical preference data can be used to improve the performance of our proposed metrics.

[1]  Fernando Diaz,et al.  Sources of evidence for vertical selection , 2009, SIGIR.

[2]  C. J. van Rijsbergen,et al.  Report on the need for and provision of an 'ideal' information retrieval test collection , 1975 .

[3]  W. Bruce Croft,et al.  Searching distributed collections with inference networks , 1995, SIGIR '95.

[4]  Charles L. A. Clarke,et al.  Novelty and diversity in information retrieval evaluation , 2008, SIGIR '08.

[5]  Fernando Diaz,et al.  Vertical selection in the presence of unlabeled verticals , 2010, SIGIR '10.

[6]  Craig MacDonald,et al.  Aggregated Search Result Diversification , 2011, ICTIR.

[7]  Mark Sanderson,et al.  Do user preferences and evaluation measures line up? , 2010, SIGIR.

[8]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[9]  Xiao Li,et al.  Learning query intent from regularized click graphs , 2008, SIGIR '08.

[10]  Tetsuya Sakai,et al.  Evaluating diversified search results using per-intent graded relevance , 2011, SIGIR.

[11]  Ashwin Satyanarayana,et al.  Evaluating whole-page relevance , 2010, SIGIR '10.

[12]  Joemon M. Jose,et al.  A Query-Basis Approach to Parametrizing Novelty-Biased Cumulative Gain , 2011, ICTIR.

[13]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[14]  Fernando Diaz,et al.  Learning to aggregate vertical results into web search results , 2011, CIKM '11.

[15]  Sreenivas Gollapudi,et al.  Diversifying search results , 2009, WSDM '09.

[16]  Qiang Yang,et al.  Beyond ten blue links: enabling user click modeling in federated web search , 2012, WSDM '12.

[17]  Fernando Diaz,et al.  A Methodology for Evaluating Aggregated Search Results , 2011, ECIR.

[18]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[19]  Ellen M. Voorhees,et al.  Overview of the TREC 2004 Novelty Track. , 2005 .

[20]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[21]  Ke Zhou,et al.  Evaluating large-scale distributed vertical search , 2011, LSDS-IR '11.

[22]  Zhi-Hua Zhou,et al.  Improving Web search using image snippets , 2008, TOIT.

[23]  Tapas Kanungo,et al.  On composition of a federated web search result page: using online users to provide pairwise preference for heterogeneous verticals , 2011, WSDM '11.

[24]  Martin Halvey,et al.  Assessing and Predicting Vertical Intent for Web Queries , 2012, ECIR.