Evaluating Web Search with a Bejeweled Player Model

The design of a Web search evaluation metric is closely related with how the user's interaction process is modeled. Each behavioral model results in a different metric used to evaluate search performance. In these models and the user behavior assumptions behind them, when a user ends a search session is one of the prime concerns because it is highly related to both benefit and cost estimation. Existing metric design usually adopts some simplified criteria to decide the stopping time point: (1) upper limit for benefit (e.g. RR, AP); (2) upper limit for cost (e.g. Precision@N, DCG@N). However, in many practical search sessions (e.g. exploratory search), the stopping criterion is more complex than the simplified case. Analyzing benefit and cost of actual users' search sessions, we find that the stopping criteria vary with search tasks and are usually combination effects of both benefit and cost factors. Inspired by a popular computer game named Bejeweled, we propose a Bejeweled Player Model (BPM) to simulate users' search interaction processes and evaluate their search performances. In the BPM, a user stops when he/she either has found sufficient useful information or has no more patience to continue. Given this assumption, a new evaluation framework based on upper limits (either fixed or changeable as search proceeds) for both benefit and cost is proposed. We show how to derive a new metric from the framework and demonstrate that it can be adopted to revise traditional metrics like Discounted Cumulative Gain (DCG), Expected Reciprocal Rank (ERR) and Average Precision (AP). To show effectiveness of the proposed framework, we compare it with a number of existing metrics in terms of correlation between user satisfaction and the metrics based on a dataset that collects users' explicit satisfaction feedbacks and assessors' relevance judgements. Experiment results show that the framework is better correlated with user satisfaction feedbacks.

[1]  H. Varian Intermediate Microeconomics: A Modern Approach , 1987 .

[2]  Alistair Moffat,et al.  Users versus models: what observation tells us about effectiveness metrics , 2013, CIKM.

[3]  S. Robertson The probability ranking principle in IR , 1997 .

[4]  Charles L. A. Clarke,et al.  Time-based calibration of effectiveness measures , 2012, SIGIR '12.

[5]  Tetsuya Sakai,et al.  Summaries, ranked retrieval and sessions: a unified framework for information access evaluation , 2013, SIGIR.

[6]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[7]  Nick Craswell,et al.  An experimental comparison of click position-bias models , 2008, WSDM '08.

[8]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[9]  Steve Fox,et al.  Evaluating implicit measures to improve web search , 2005, TOIS.

[10]  Jacob Cohen,et al.  Applied multiple regression/correlation analysis for the behavioral sciences , 1979 .

[11]  Peter Pirolli,et al.  Information Foraging , 2009, Encyclopedia of Database Systems.

[12]  Ryen W. White,et al.  Struggling or exploring?: disambiguating long search sessions , 2014, WSDM.

[13]  Yiqun Liu,et al.  When does Relevance Mean Usefulness and User Satisfaction in Web Search? , 2016, SIGIR.

[14]  Guido Zuccon,et al.  An Analysis of Theories of Search and Search Behavior , 2015, ICTIR.

[15]  Milad Shokouhi,et al.  Expected browsing utility for web search evaluation , 2010, CIKM.

[16]  Jaana Kekäläinen,et al.  Expected reading effort in focused retrieval evaluation , 2010, Information Retrieval.

[17]  Guido Zuccon,et al.  Understandability Biased Evaluation for Information Retrieval , 2016, ECIR.

[18]  Yiqun Liu,et al.  Different Users, Different Opinions: Predicting Search Satisfaction with Mouse Movement Information , 2015, SIGIR.

[19]  Marcia J. Bates,et al.  The design of browsing and berrypicking techniques for the online search interface , 1989 .

[20]  Stefano Mizzaro Relevance: the whole history , 1997 .

[21]  Scott B. Huffman,et al.  How well does result relevance predict session satisfaction? , 2007, SIGIR.

[22]  William S. Cooper,et al.  On selecting a measure of retrieval effectiveness , 1973, J. Am. Soc. Inf. Sci..

[23]  Cyril W. Cleverdon,et al.  Factors determining the performance of indexing systems , 1966 .

[24]  Ben Carterette,et al.  System effectiveness, user models, and user utility: a conceptual framework for investigation , 2011, SIGIR.

[25]  Norbert Fuhr,et al.  A probability ranking principle for interactive information retrieval , 2008, Information Retrieval.

[26]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[27]  G. Stigler The Economics of Information , 1961, Journal of Political Economy.

[28]  Guido Zuccon,et al.  An Analysis of the Cost and Benefit of Search Interactions , 2016, ICTIR.

[29]  Jin Zhang,et al.  Multidimensional relevance modeling via psychometrics and crowdsourcing , 2014, SIGIR.

[30]  M. de Rijke,et al.  Click model-based information retrieval metrics , 2013, SIGIR.

[31]  Ben Shneiderman,et al.  Determining Causes and Severity of End-User Frustration , 2004, Int. J. Hum. Comput. Interact..

[32]  Peter Bailey,et al.  User Variability and IR System Evaluation , 2015, SIGIR.

[33]  Mark Sanderson,et al.  The relationship between IR effectiveness measures and user satisfaction , 2007, SIGIR.

[34]  Meng Wang,et al.  Does Vertical Bring more Satisfaction?: Predicting Search Satisfaction in a Heterogeneous Environment , 2015, CIKM.

[35]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[36]  Leif Azzopardi,et al.  The economics in interactive information retrieval , 2011, SIGIR.

[37]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[38]  Diane Kelly,et al.  Methods for Evaluating Interactive Information Retrieval Systems with Users , 2009, Found. Trends Inf. Retr..