Modeling user variance in time-biased gain

Cranfield-style information retrieval evaluation considers variance in user information needs by evaluating retrieval systems over a set of search topics. For each search topic, traditional metrics model all users searching ranked lists in exactly the same manner and thus have zero variance in their per-topic estimate of effectiveness. Metrics that fail to model user variance overestimate the effect size of differences between retrieval systems. The modeling of user variance is critical to understanding the impact of effectiveness differences on the actual user experience. If the variance of a difference is high, the effect on user experience will be low. Time-biased gain is an evaluation metric that models user interaction with ranked lists that are displayed using document surrogates. In this paper, we extend the stochastic simulation of time-biased gain to model the variation between users. We validate this new version of time-biased gain by showing that it produces distributions of gain that agree well with actual distributions produced by real users. With a per-topic variance in its effectiveness measure, time-biased gain allows for the measurement of the effect size of differences, which allows researchers to understand the extent to which predicted performance improvements matter to real users.

[1]  Barry Smyth,et al.  Predictive modeling of first-click behavior in web-search , 2006, WWW '06.

[2]  Catherine L. Smith,et al.  User adaptation: good results from poor systems , 2008, SIGIR '08.

[3]  Leif Azzopardi,et al.  The economics in interactive information retrieval , 2011, SIGIR.

[4]  Charles L. A. Clarke,et al.  Stochastic simulation of time-biased gain , 2012, CIKM '12.

[5]  Charles L. A. Clarke,et al.  A comparative analysis of cascade measures for novelty and diversity , 2011, WSDM '11.

[6]  Gordon V. Cormack,et al.  Statistical precision of information retrieval evaluation , 2006, SIGIR.

[7]  Stephen E. Robertson,et al.  A new interpretation of average precision , 2008, SIGIR '08.

[8]  Ed H. Chi,et al.  Using information scent to model user information needs and actions and the Web , 2001, CHI.

[9]  Andrew Turpin,et al.  Do batch and user evaluations give the same results? , 2000, SIGIR '00.

[10]  Javed A. Aslam,et al.  IR system evaluation using nugget-based test collections , 2012, WSDM '12.

[11]  Mark D. Smucker,et al.  Report on the SIGIR 2010 workshop on the simulation of interaction , 2011, SIGF.

[12]  Ricardo A. Baeza-Yates,et al.  Modeling user search behavior , 2005, Third Latin American Web Congress (LA-WEB'2005).

[13]  Ellen M. Voorhees I Come Not To Bury Cranfield, but to Praise It | NIST , 2009 .

[14]  Mark D. Dunlop Time, relevance and interaction modelling for information retrieval , 1997, SIGIR '97.

[15]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[16]  Ben Carterette,et al.  Simulating simple user behavior for system effectiveness evaluation , 2011, CIKM '11.

[17]  Kalervo Järvelin,et al.  Test Collection-Based IR Evaluation Needs Extension toward Sessions - A Case of Extremely Short Queries , 2009, AIRS.

[18]  Mark D. Smucker,et al.  Human performance and retrieval precision revisited , 2010, SIGIR.

[19]  Charles L. A. Clarke,et al.  Time-based calibration of effectiveness measures , 2012, SIGIR '12.

[20]  M. Smucker An Analysis of User Strategies for Examining and Processing Ranked Lists of Documents , 2011 .

[21]  Alistair Moffat,et al.  Click-based evidence for decaying weight distributions in search effectiveness metrics , 2010, Information Retrieval.

[22]  Jimmy J. Lin,et al.  How do users find things with PubMed?: towards automatic utility evaluation with user simulations , 2008, SIGIR '08.

[23]  Susan T. Dumais,et al.  Individual differences in gaze patterns for web search , 2010, IIiX.

[24]  Milad Shokouhi,et al.  Expected browsing utility for web search evaluation , 2010, CIKM.

[25]  Ellen M. Voorhees,et al.  Overview of TREC 2005 , 2005, TREC.

[26]  Mark W. Lipsey,et al.  Practical Meta-Analysis , 2000 .

[27]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[28]  Ryen W. White,et al.  Evaluating implicit feedback models using searcher simulations , 2005, TOIS.

[29]  E N Weiss,et al.  An iterative estimation and validation procedure for specification of semi-Markov models with application to hospital patient flow. , 1982, Operations research.

[30]  Päivi Majaranta,et al.  Eye-Tracking Reveals the Personal Styles for Search Result Evaluation , 2005, INTERACT.