The Fault, Dear Researchers, is not in Cranfield, But in our Metrics, that they are Unrealistic

As designers of information retrieval (IR) systems, we need some way to measure the performance of our systems. An excellent approach to take is to directly measure actual user performance either in situ or in the laboratory [12]. The downside of live user involvement is the prohibitive cost if many evaluations are required. For example, it is common practice to sweep parameter settings for ranking algorithms in order to optimize retrieval metrics on a test collection. The Cranfield approach to IR evaluation provides low-cost, reusable measures of system performance. Cranfield-style evaluation frequently has been criticized as being too divorced from the reality of how users search, but there really is nothing wrong with the approach [18]. The Cranfield approach effectively is a simulation of IR system usage that attempts to make a prediction about the performance of one system vs. another [15]. As such, we should really be thinking of the Cranfield approach as the application of models to make predictions, which is common practice in science and engineering. For example, physics has equations of motion. Civil engineering has models of concrete strength. Epidemiology has models of disease spread. Etc. In all of these fields, it is well understood that the models are simplifications of reality, but that the models provide the ability to make useful predictions. Information retrieval’s predictive models are our evaluation metrics. The criticism of system-oriented IR evaluation should be redirected. The problem is not with Cranfield — which is just another name for making predictions given a model — the problem is with the metrics. We believe that rather than criticizing Cranfield, the correct response is to develop better metrics. We should make metrics that are more predictive of human performance. We should make metrics that incorporate the user interface and realistically represent the variation in user behavior. We should make metrics that encapsulate our best understanding of search behavior. In popular parlance, we should bring solutions, not problems, to the system-oriented IR researcher. To this end, we have developed a new evaluation metric, time-biased gain (TBG), that predicts IR system performance in human terms of the expected number of relevant documents to be found by a user [16].

[1]  Rehan Khan,et al.  The impact of result abstracts on task completion time. , 2009 .

[2]  Milad Shokouhi,et al.  Expected browsing utility for web search evaluation , 2010, CIKM.

[3]  Mark D. Smucker,et al.  Report on the SIGIR 2010 workshop on the simulation of interaction , 2011, SIGF.

[4]  Alistair Moffat,et al.  Click-based evidence for decaying weight distributions in search effectiveness metrics , 2010, Information Retrieval.

[5]  Andrew Turpin,et al.  Do batch and user evaluations give the same results? , 2000, SIGIR '00.

[6]  Ellen M. Voorhees I Come Not To Bury Cranfield, but to Praise It | NIST , 2009 .

[7]  Mark D. Dunlop Time, relevance and interaction modelling for information retrieval , 1997, SIGIR '97.

[8]  Leif Azzopardi Usage based effectiveness measures: monitoring application performance in information retrieval , 2009, CIKM.

[9]  J. Shane Culpepper,et al.  Including summaries in system evaluation , 2009, SIGIR.

[10]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[11]  Ben Carterette,et al.  Simulating simple user behavior for system effectiveness evaluation , 2011, CIKM '11.

[12]  Kalervo Järvelin,et al.  Test Collection-Based IR Evaluation Needs Extension toward Sessions - A Case of Extremely Short Queries , 2009, AIRS.

[13]  Charles L. A. Clarke,et al.  Time-based calibration of effectiveness measures , 2012, SIGIR '12.

[14]  Marti A. Hearst,et al.  The state of the art in automating usability evaluation of user interfaces , 2001, CSUR.

[15]  Georges Dupret Discounted Cumulative Gain and User Decision Models , 2011, SPIRE.

[16]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[17]  Donna Harman,et al.  Information Retrieval Evaluation , 2011, Synthesis Lectures on Information Concepts, Retrieval, and Services.

[18]  Jimmy J. Lin,et al.  How do users find things with PubMed?: towards automatic utility evaluation with user simulations , 2008, SIGIR '08.

[19]  Diane Kelly,et al.  Methods for Evaluating Interactive Information Retrieval Systems with Users , 2009, Found. Trends Inf. Retr..

[20]  J. Banks,et al.  Discrete-Event System Simulation , 1995 .