Incorporating Query Reformulating Behavior into Web Search Evaluation

While batch evaluation plays a central part in Information Retrieval (IR) research, most evaluation metrics are based on user models which mainly focus on browsing and clicking behaviors. As users' perceived satisfaction may also be impacted by their search intent, constructing different user models across various search intent may help design better evaluation metrics. However, user intents are usually unobservable in practice. As query reformulating behaviors may reflect their search intents to a certain extent and highly correlate with users' perceived satisfaction for a specific query, these observable factors may be beneficial for the design of evaluation metrics. How to incorporate the search intent behind query reformulation into user behavior and satisfaction models remains under-investigated. To investigate the relationships among query reformulations, search intent, and user satisfaction, we explore a publicly available web search dataset and find that query reformulations can be a good proxy for inferring user intent, and therefore, reformulating actions may be beneficial for designing better web search effectiveness metrics. A group of Reformulation-Aware Metrics (RAMs) is then proposed to improve existing click model-based metrics. Experimental results on two public session datasets have shown that RAMs have significantly higher correlations with user satisfaction than existing evaluation metrics. In the robustness test, we have found that RAMs can achieve good performance when only a small proportion of satisfaction training labels are available. We further show that RAMs can be directly applied in a new dataset for offline evaluation once trained. This work shows the possibility of designing better evaluation metrics by incorporating fine-grained search context factors.

[1]  M. de Rijke,et al.  Click Models for Web Search , 2015, Click Models for Web Search.

[2]  Yiqun Liu,et al.  Investigating Query Reformulation Behavior of Search Users , 2019, CCIR.

[3]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[4]  D. Harman,et al.  TREC: Experiment and Evaluation in Information Retrieval , 2006 .

[5]  Mark Sanderson,et al.  Test Collection Based Evaluation of Information Retrieval Systems , 2010, Found. Trends Inf. Retr..

[6]  Ben Carterette,et al.  From a User Model for Query Sessions to Session Rank Biased Precision (sRBP) , 2019, ICTIR.

[7]  M. de Rijke,et al.  Click model-based information retrieval metrics , 2013, SIGIR.

[8]  Alistair Moffat,et al.  Empirical Evidence for Search Effectiveness Models , 2018, CIKM.

[9]  Tetsuya Sakai,et al.  Summaries, ranked retrieval and sessions: a unified framework for information access evaluation , 2013, SIGIR.

[10]  Peter Bailey,et al.  User Variability and IR System Evaluation , 2015, SIGIR.

[11]  Yiqun Liu,et al.  Different Users, Different Opinions: Predicting Search Satisfaction with Mouse Movement Information , 2015, SIGIR.

[12]  Cyril W. Cleverdon,et al.  Factors determining the performance of indexing systems , 1966 .

[13]  Olivier Chapelle,et al.  A dynamic bayesian network click model for web search ranking , 2009, WWW '09.

[14]  Tetsuya Sakai,et al.  Evaluating evaluation metrics based on the bootstrap , 2006, SIGIR.

[15]  Nick Craswell,et al.  An experimental comparison of click position-bias models , 2008, WSDM '08.

[16]  Milad Shokouhi,et al.  Expected browsing utility for web search evaluation , 2010, CIKM.

[17]  Grace Hui Yang,et al.  Win-win search: dual-agent stochastic game in session search , 2014, SIGIR.

[18]  Philip Sedgwick,et al.  Multiple significance tests: the Bonferroni correction , 2012, BMJ : British Medical Journal.

[19]  酒井 哲也,et al.  5分で分かる ! ? 有名論文ナナメ読み:Cyril W. Cleverdon, Jack Mills, E. Michael Keen : Factors Determining the Performance of Indexing Systems ; Volume 1 : Design , 2019 .

[20]  Oren Kurland,et al.  Query Reformulation in E-Commerce Search , 2020, SIGIR.

[21]  Alistair Moffat,et al.  Users versus models: what observation tells us about effectiveness metrics , 2013, CIKM.

[22]  Alistair Moffat,et al.  cwl_eval: An Evaluation Tool for Information Retrieval , 2019, SIGIR.

[23]  Yiqun Liu,et al.  Models Versus Satisfaction: Towards a Better Understanding of Evaluation Metrics , 2020, SIGIR.

[24]  Grace Hui Yang,et al.  Utilizing query change for session search , 2013, SIGIR.

[25]  Nick Craswell,et al.  Beyond clicks: query reformulation as a predictor of search satisfaction , 2013, CIKM.

[26]  Mark Sanderson,et al.  A review of factors influencing user satisfaction in information retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[27]  Karl Pearson,et al.  Mathematical Contributions to the Theory of Evolution. VIII. On the Inheritance of Characters not Capable of Exact Quantitative Measurement. Part I. Introductory. Part II. On the Inheritance of Coat-Colour in Horses. Part III. On the Inheritance of Eye-Colour in Man , 1900 .

[28]  Efthimis N. Efthimiadis,et al.  Analyzing and evaluating query reformulation strategies in web search logs , 2009, CIKM.

[29]  Charles L. A. Clarke,et al.  Time-based calibration of effectiveness measures , 2012, SIGIR '12.

[30]  Fan Zhang,et al.  Evaluating Web Search with a Bejeweled Player Model , 2017, SIGIR.

[31]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[32]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[33]  Yiqun Liu,et al.  TianGong-ST: A New Dataset with Large-scale Refined Real-world Web Search Sessions , 2019, CIKM.

[34]  Mark Sanderson,et al.  Do user preferences and evaluation measures line up? , 2010, SIGIR.

[35]  Ben Carterette,et al.  System effectiveness, user models, and user utility: a conceptual framework for investigation , 2011, SIGIR.

[36]  Filip Radlinski,et al.  Inferring query intent from reformulations and clicks , 2010, WWW '10.

[37]  Amanda Spink,et al.  Patterns of query reformulation during Web searching , 2009, J. Assoc. Inf. Sci. Technol..

[38]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[39]  Alistair Moffat,et al.  Metrics, User Models, and Satisfaction , 2020, WSDM.

[40]  Benjamin Piwowarski,et al.  A user browsing model to predict search engine click data from past observations. , 2008, SIGIR '08.

[41]  C. Spearman The proof and measurement of association between two things. , 2015, International journal of epidemiology.

[42]  Gary Marchionini,et al.  Synthesis Lectures on Information Concepts, Retrieval, and Services , 2009 .

[43]  Peter Bailey,et al.  Incorporating User Expectations and Behavior into the Measurement of Search Effectiveness , 2017, ACM Trans. Inf. Syst..