Better Effectiveness Metrics for SERPs, Cards, and Rankings

Offline metrics for IR evaluation are often derived from a user model that seeks to capture the interaction between the user and the ranking, conflating the interaction with a ranking of documents with the user's interaction with the search results page. A desirable property of any effectiveness metric is if the scores it generates over a set of rankings correlate well with the "satisfaction" or "goodness" scores attributed to those same rankings by a population of searchers. Using data from a large-scale web search engine, we find that offline effectiveness metrics do not correlate well with a behavioural measure of satisfaction that can be inferred from user activity logs. We then examine three mechanisms to improve the correlation: tuning the model parameters; improving the label coverage, so that more kinds of item are labelled and hence included in the evaluation; and modifying the underlying user models that describe the metrics. In combination, these three mechanisms transform a wide range of common metrics into "card-aware" variants which allow for the gain from cards (or snippets), varying probabilities of clickthrough, and good abandonment.

[1]  Susan T. Dumais,et al.  The good, the bad, and the random: an eye-tracking study of ad quality in web search , 2010, SIGIR.

[2]  Alistair Moffat,et al.  Users versus models: what observation tells us about effectiveness metrics , 2013, CIKM.

[3]  Fan Zhang,et al.  Evaluating Mobile Search with Height-Biased Gain , 2017, SIGIR.

[4]  Jane Li,et al.  Good abandonment in mobile and PC internet search , 2009, SIGIR.

[5]  Peter Bailey,et al.  Incorporating User Expectations and Behavior into the Measurement of Search Effectiveness , 2017, ACM Trans. Inf. Syst..

[6]  Yiqun Liu,et al.  Meta-evaluation of Online and Offline Web Search Evaluation Metrics , 2017, SIGIR.

[7]  Charles L. A. Clarke,et al.  Time-based calibration of effectiveness measures , 2012, SIGIR '12.

[8]  Paul Thomas,et al.  Measuring the Utility of Search Engine Result Pages: An Information Foraging Based Measure , 2018, SIGIR.

[9]  Ashwin Satyanarayana,et al.  Evaluating search systems using result page context , 2010, IIiX.

[10]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[11]  Madian Khabsa,et al.  Learning to Account for Good Abandonment in Search Success Metrics , 2016, CIKM.

[12]  Milad Shokouhi,et al.  Expected browsing utility for web search evaluation , 2010, CIKM.

[13]  Alistair Moffat,et al.  Click-based evidence for decaying weight distributions in search effectiveness metrics , 2010, Information Retrieval.

[14]  Jasjeet S. Sekhon,et al.  Genetic Optimization Using Derivatives , 2011, Political Analysis.

[15]  Milad Shokouhi,et al.  From Queries to Cards: Re-ranking Proactive Card Recommendations Based on Reactive Search History , 2015, SIGIR.

[16]  Barbara S. Chaparro,et al.  Text Advertising Blindness: The New Banner Blindness? , 2011 .

[17]  Yinan Zhang,et al.  A Sequential Decision Formulation of the Interface Card Model for Interactive IR , 2016, SIGIR.

[18]  James Allan,et al.  Correlation Between System and User Metrics in a Session , 2016, CHIIR.

[19]  Maarten de Rijke,et al.  Vertical-Aware Click Model-Based Effectiveness Metrics , 2014, CIKM.

[20]  Lois M. L. Delcambre,et al.  Discounted Cumulated Gain Based Evaluation of Multiple-Query IR Sessions , 2008, ECIR.

[21]  Yiqun Liu,et al.  When does Relevance Mean Usefulness and User Satisfaction in Web Search? , 2016, SIGIR.

[22]  Gabriella Kazai,et al.  Relevance dimensions in preference-based IR evaluation , 2013, SIGIR.

[23]  J. Shane Culpepper,et al.  Including summaries in system evaluation , 2009, SIGIR.

[24]  Nick Craswell,et al.  Beyond clicks: query reformulation as a predictor of search satisfaction , 2013, CIKM.

[25]  Jaime Arguello,et al.  Task complexity, vertical display and user interaction in aggregated search , 2012, SIGIR '12.

[26]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[27]  Yiqun Liu,et al.  Incorporating vertical results into search click models , 2013, SIGIR.

[28]  Charles L. A. Clarke,et al.  A comparative analysis of cascade measures for novelty and diversity , 2011, WSDM '11.

[29]  Lydia B. Chilton,et al.  Addressing people's information needs directly in a web search result page , 2011, WWW.

[30]  Diane Kelly,et al.  Methods for Evaluating Interactive Information Retrieval Systems with Users , 2009, Found. Trends Inf. Retr..