Meta-evaluation of Online and Offline Web Search Evaluation Metrics

As in most information retrieval (IR) studies, evaluation plays an essential part in Web search research. Both offline and online evaluation metrics are adopted in measuring the performance of search engines. Offline metrics are usually based on relevance judgments of query-document pairs from assessors while online metrics exploit the user behavior data, such as clicks, collected from search engines to compare search algorithms. Although both types of IR evaluation metrics have achieved success, to what extent can they predict user satisfaction still remains under-investigated. To shed light on this research question, we meta-evaluate a series of existing online and offline metrics to study how well they infer actual search user satisfaction in different search scenarios. We find that both types of evaluation metrics significantly correlate with user satisfaction while they reflect satisfaction from different perspectives for different search tasks. Offline metrics better align with user satisfaction in homogeneous search (i.e. ten blue links) whereas online metrics outperform when vertical results are federated. Finally, we also propose to incorporate mouse hover information into existing online evaluation metrics, and empirically show that they better align with search user satisfaction than click-based online metrics.

[1]  Alistair Moffat,et al.  A similarity measure for indefinite rankings , 2010, TOIS.

[2]  Meng Wang,et al.  Does Vertical Bring more Satisfaction?: Predicting Search Satisfaction in a Heterogeneous Environment , 2015, CIKM.

[3]  Mark Sanderson,et al.  Test Collection Based Evaluation of Information Retrieval Systems , 2010, Found. Trends Inf. Retr..

[4]  Eugene Agichtein,et al.  Find it if you can: a game for modeling different types of web search success using interaction data , 2011, SIGIR.

[5]  Ahmed Hassan Awadallah,et al.  Beyond DCG: user behavior as a predictor of a successful search , 2010, WSDM '10.

[6]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[7]  Yiqun Liu,et al.  Influence of Vertical Result in Web Search Examination , 2015, SIGIR.

[8]  Ben Carterette,et al.  Evaluating multi-query sessions , 2011, SIGIR.

[9]  Jane Li,et al.  Good abandonment in mobile and PC internet search , 2009, SIGIR.

[10]  Yiqun Liu,et al.  Different Users, Different Opinions: Predicting Search Satisfaction with Mouse Movement Information , 2015, SIGIR.

[11]  Stephen E. Robertson,et al.  On the choice of effectiveness measures for learning to rank , 2010, Information Retrieval.

[12]  Louise T. Su A comprehensive and systematic model of user evaluation of Web search engines: II. An evaluation by undergraduates , 2003, J. Assoc. Inf. Sci. Technol..

[13]  Filip Radlinski,et al.  How does clickthrough data reflect retrieval quality? , 2008, CIKM '08.

[14]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[15]  Ben Carterette,et al.  Low cost evaluation in information retrieval , 2010, SIGIR '10.

[16]  Ellen M. Voorhees,et al.  Evaluating evaluation measure stability , 2000, SIGIR '00.

[17]  Diane Kelly,et al.  Methods for Evaluating Interactive Information Retrieval Systems with Users , 2009, Found. Trends Inf. Retr..

[18]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[19]  Ron Kohavi,et al.  Responsible editor: R. Bayardo. , 2022 .

[20]  Eugene Agichtein,et al.  Predicting web search success with fine-grained interaction data , 2012, CIKM.

[21]  Ryen W. White,et al.  Understanding and Predicting Graded Search Satisfaction , 2015, WSDM.

[22]  Steve Fox,et al.  Evaluating implicit measures to improve web search , 2005, TOIS.

[23]  Filip Radlinski,et al.  Predicting Search Satisfaction Metrics with Interleaved Comparisons , 2015, SIGIR.

[24]  Filip Radlinski,et al.  Online Evaluation for Information Retrieval , 2016, Found. Trends Inf. Retr..

[25]  Emine Yilmaz,et al.  A simple and efficient sampling method for estimating AP and NDCG , 2008, SIGIR '08.

[26]  Lihong Li,et al.  Toward Predicting the Outcome of an A/B Experiment for Search Relevance , 2015, WSDM.

[27]  M. de Rijke,et al.  Click model-based information retrieval metrics , 2013, SIGIR.

[28]  Alexander J. Smola,et al.  Measurement and modeling of eye-mouse behavior in the presence of nonlinear page layouts , 2013, WWW.

[29]  Yang Song,et al.  A task level metric for measuring web search satisfaction and its application on improving relevance estimation , 2011, CIKM '11.

[30]  Imed Zitouni,et al.  Predicting User Satisfaction with Intelligent Assistants , 2016, SIGIR.

[31]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[32]  Jacob Benesty,et al.  Pearson Correlation Coefficient , 2009 .

[33]  Louise T. Su Evaluation Measures for Interactive Information Retrieval , 1992, Inf. Process. Manag..

[34]  Tetsuya Sakai,et al.  Evaluating evaluation metrics based on the bootstrap , 2006, SIGIR.

[35]  Daqing He,et al.  Searching, browsing, and clicking in a search session: changes in user behavior by task and over time , 2014, SIGIR.

[36]  Chih-Hung Hsieh,et al.  Towards better measurement of attention and satisfaction in mobile search , 2014, SIGIR.

[37]  M. de Rijke,et al.  Click-based Hot Fixes for Underperforming Torso Queries , 2016, SIGIR.

[38]  Benjamin S. Bloom,et al.  A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom's Taxonomy of Educational Objectives , 2000 .

[39]  Yang Song,et al.  Modeling action-level satisfaction for search task satisfaction prediction , 2014, SIGIR.

[40]  Alistair Moffat,et al.  Score Estimation, Incomplete Judgments, and Significance Testing in IR Evaluation , 2010, AIRS.

[41]  Cyril W. Cleverdon,et al.  Factors determining the performance of indexing systems , 1966 .

[42]  Noriko Kando,et al.  On information retrieval metrics designed for evaluation with incomplete relevance assessments , 2008, Information Retrieval.

[43]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[44]  Scott B. Huffman,et al.  How well does result relevance predict session satisfaction? , 2007, SIGIR.

[45]  J. Shane Culpepper,et al.  The effect of pooling and evaluation depth on IR metrics , 2016, Information Retrieval Journal.

[46]  Yiqun Liu,et al.  When does Relevance Mean Usefulness and User Satisfaction in Web Search? , 2016, SIGIR.

[47]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[48]  D. Krathwohl A Taxonomy for Learning, Teaching and Assessing: , 2008 .

[49]  Alex Deng,et al.  Data-Driven Metric Development for Online Controlled Experiments: Seven Lessons Learned , 2016, KDD.

[50]  Mark Sanderson,et al.  The relationship between IR effectiveness measures and user satisfaction , 2007, SIGIR.

[51]  酒井 哲也,et al.  5分で分かる ! ? 有名論文ナナメ読み:Cyril W. Cleverdon, Jack Mills, E. Michael Keen : Factors Determining the Performance of Indexing Systems ; Volume 1 : Design , 2019 .

[52]  Charles L. A. Clarke,et al.  Reliable information retrieval evaluation with incomplete and biased judgements , 2007, SIGIR.

[53]  Mansur Lalljee,et al.  Measuring left right and libertarian- authoritarian values in the British electorate , 1996 .

[54]  Tetsuya Sakai How Intuitive Are Diversified Search Metrics? Concordance Test Results for the Diversity U-Measures , 2013, AIRS.

[55]  Nick Craswell,et al.  Beyond clicks: query reformulation as a predictor of search satisfaction , 2013, CIKM.

[56]  Filip Radlinski,et al.  Comparing the sensitivity of information retrieval metrics , 2010, SIGIR.