Cascade or Recency: Constructing Better Evaluation Metrics for Session Search

Recently session search evaluation has been paid more attention as a realistic search scenario usually involves multiple queries and interactions between users and systems. Evolved from model-based evaluation metrics for a single query, existing session-based metrics also follow a generic framework based on the cascade hypothesis. The cascade hypothesis assumes that lower-ranked search results and later-issued queries receive less attention from users and should therefore be assigned smaller weights when calculating evaluation metrics. This hypothesis gains much success in modeling search users' behavior and designing evaluation metrics, by explaining why users' attention decays on search engine result pages. However, recent studies have found that the recency effect also plays an important role in determining user satisfaction in search sessions. Especially, whether a user feels satisfied in the later-issued queries heavily influences his/her search satisfaction in the whole session. To take both the cascade hypothesis and the recency effect into the design of session search evaluation metrics, we propose Recency-aware Session-based Metrics (RSMs) to simultaneously characterize users' examination process with a browsing model and cognitive process with a utility accumulation model. With both self-constructed and public available user search behavior datasets, we show the effectiveness of proposed RSMs by comparing them with existing session-based metrics in the light of correlation with user satisfaction. We also find that the influence of the cascade and the recency effects varies dramatically among tasks with different difficulties and complexities, which suggests that we should use different model parameters for different types of search tasks. Our findings highlight the importance of investigating and utilizing cognitive effects besides examination hypotheses in search evaluation.

[1]  Ben Carterette,et al.  From a User Model for Query Sessions to Session Rank Biased Precision (sRBP) , 2019, ICTIR.

[2]  Alistair Moffat,et al.  Users versus models: what observation tells us about effectiveness metrics , 2013, CIKM.

[3]  Peter Bailey,et al.  User Variability and IR System Evaluation , 2015, SIGIR.

[4]  Ben Carterette,et al.  Dynamic Test Collections for Retrieval Evaluation , 2015, ICTIR.

[5]  Grace Hui Yang,et al.  The water filling model and the cube test: multi-dimensional evaluation for professional search , 2013, CIKM.

[6]  Grace Hui Yang,et al.  Investigating per Topic Upper Bound for Session Search Evaluation , 2017, ICTIR.

[7]  Daqing He,et al.  Searching, browsing, and clicking in a search session: changes in user behavior by task and over time , 2014, SIGIR.

[8]  Paul Thomas,et al.  Measuring the Utility of Search Engine Result Pages: An Information Foraging Based Measure , 2018, SIGIR.

[9]  Charles L. A. Clarke,et al.  Time-based calibration of effectiveness measures , 2012, SIGIR '12.

[10]  A. Tversky,et al.  Judgment under Uncertainty: Heuristics and Biases , 1974, Science.

[11]  Yinan Zhang,et al.  Information Retrieval Evaluation as Search Simulation: A General Formal Framework for IR Evaluation , 2017, ICTIR.

[12]  Lois M. L. Delcambre,et al.  Discounted Cumulated Gain Based Evaluation of Multiple-Query IR Sessions , 2008, ECIR.

[13]  Grace Hui Yang,et al.  TREC 2016 Dynamic Domain Track Overview , 2016, TREC.

[14]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[15]  Alistair Moffat,et al.  Click-based evidence for decaying weight distributions in search effectiveness metrics , 2010, Information Retrieval.

[16]  Emine Yilmaz,et al.  User Behaviour and Task Characteristics: A Field Study of Daily Information Behaviour , 2017, CHIIR.

[17]  A D Baddeley,et al.  Prior recall of newly learned items and the recency effect in free recall. , 1968, Canadian journal of psychology.

[18]  Yiqun Liu,et al.  Towards Designing Better Session Search Evaluation Metrics , 2018, SIGIR.

[19]  Jacek Gwizdka,et al.  Search behaviors in different task types , 2010, JCDL '10.

[20]  Nick Craswell,et al.  An experimental comparison of click position-bias models , 2008, WSDM '08.

[21]  Milad Shokouhi,et al.  Expected browsing utility for web search evaluation , 2010, CIKM.

[22]  Ben Carterette,et al.  System effectiveness, user models, and user utility: a conceptual framework for investigation , 2011, SIGIR.

[23]  James Allan,et al.  Correlation Between System and User Metrics in a Session , 2016, CHIIR.

[24]  Christoph Trattner,et al.  Good Times Bad Times: A Study on Recency Effects in Collaborative Filtering for Social Tagging , 2015, RecSys.

[25]  Fan Zhang,et al.  Evaluating Web Search with a Bejeweled Player Model , 2017, SIGIR.

[26]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[27]  Rosie Jones,et al.  Beyond the session timeout: automatic hierarchical segmentation of search topics in query logs , 2008, CIKM '08.

[28]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[29]  Yiqun Liu,et al.  When does Relevance Mean Usefulness and User Satisfaction in Web Search? , 2016, SIGIR.

[30]  Cyril W. Cleverdon,et al.  Aslib Cranfield research project - Factors determining the performance of indexing systems; Volume 1, Design; Part 2, Appendices , 1966 .

[31]  Yiming Yang,et al.  Modeling Expected Utility of Multi-session Information Distillation , 2009, ICTIR.

[32]  Mark Sanderson,et al.  Test Collection Based Evaluation of Information Retrieval Systems , 2010, Found. Trends Inf. Retr..

[33]  Fan Zhang,et al.  Evaluating Mobile Search with Height-Biased Gain , 2017, SIGIR.

[34]  Yiqun Liu,et al.  Investigating Cognitive Effects in Session-level Search User Satisfaction , 2019, KDD.

[35]  Tetsuya Sakai,et al.  Summaries, ranked retrieval and sessions: a unified framework for information access evaluation , 2013, SIGIR.

[36]  Ben Carterette,et al.  Overview of the TREC 2014 Session Track , 2014, TREC.

[37]  Yiqun Liu,et al.  The Influence of Image Search Intents on User Behavior and Satisfaction , 2019, WSDM.