A Methodology for Evaluating Aggregated Search Results

Aggregated search is the task of incorporating results from different specialized search services, or verticals, into Web search results. While most prior work focuses on deciding which verticals to present, the task of deciding where in the Web results to embed the vertical results has received less attention. We propose a methodology for evaluating an aggregated set of results. Our method elicits a relatively small number of human judgements for a given query and then uses these to facilitate a metric-based evaluation of any possible presentation for the query. An extensive user study with 13 verticals confirms that, when users prefer one presentation of results over another, our metric agrees with the stated preference. By using Amazon's Mechanical Turk, we show that reliable assessments can be obtained quickly and inexpensively.

[1]  David Maxwell Chickering,et al.  Here or There , 2008, ECIR.

[2]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[3]  Paul Brown Swimming pool , 1998, SIGGRAPH '98.

[4]  David Hawking,et al.  Evaluation by comparing result sets in context , 2006, CIKM '06.

[5]  Ben Carterette,et al.  An Analysis of Assessor Behavior in Crowdsourced Preference Judgments , 2010 .

[6]  Mark Sanderson,et al.  Do user preferences and evaluation measures line up? , 2010, SIGIR.

[7]  Fernando Diaz,et al.  Sources of evidence for vertical selection , 2009, SIGIR.

[8]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[9]  Markus Schulze,et al.  A new monotonic, clone-independent, reversal symmetric, and condorcet-consistent single-winner election method , 2011, Soc. Choice Welf..

[10]  Fernando Diaz,et al.  Integration of news content into web results , 2009, WSDM '09.

[11]  David Maxwell Chickering,et al.  Here or there: preference judgments for relevance , 2008 .

[12]  Sergei Vassilvitskii,et al.  Generalized distances between rankings , 2010, WWW '10.

[13]  Fernando Diaz,et al.  Vertical selection in the presence of unlabeled verticals , 2010, SIGIR '10.

[14]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[15]  Fernando Diaz,et al.  Adaptation of offline vertical selection predictions in the presence of user feedback , 2009, SIGIR.

[16]  Peter Ingwersen,et al.  Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.

[17]  Robert Villa,et al.  Factors affecting click-through behavior in aggregated search interfaces , 2010, CIKM.

[18]  Xiao Li,et al.  Learning query intent from regularized click graphs , 2008, SIGIR '08.