User performance versus precision measures for simple search tasks

Several recent studies have demonstrated that the type of improvements in information retrieval system effectiveness reported in forums such as SIGIR and TREC do not translate into a benefit for users. Two of the studies used an instance recall task, and a third used a question answering task, so perhaps it is unsurprising that the precision based measures of IR system effectiveness on one-shot query evaluation do not correlate with user performance on these tasks. In this study, we evaluate two different information retrieval tasks on TREC Web-track data: a precision-based user task, measured by the length of time that users need to find a single document that is relevant to a TREC topic; and, a simple recall-based task, represented by the total number of relevant documents that users can identify within five minutes. Users employ search engines with controlled mean average precision (MAP) of between 55% and 95%. Our results show that there is no significant relationship between system effectiveness measured by MAP and the precision-based task. A significant, but weak relationship is present for the precision at one document returned metric. A weak relationship is present between MAP and the simple recall-based task.

[1]  Michael Eisenberg,et al.  Order effects: A study of the possible influence of presentation order on user judgments of document relevance , 1988, J. Am. Soc. Inf. Sci..

[2]  Carol L. Barry,et al.  Order Effects: A Study of the Possible Influence of Presentation Order on User Judgments of Document Relevance. , 1988 .

[3]  Stephen P. Harter,et al.  Variations in Relevance Assessments and the Measurement of Retrieval Effectiveness , 1996, J. Am. Soc. Inf. Sci..

[4]  Peter Willett,et al.  Readings in information retrieval , 1997 .

[5]  Cyril Cleverdon,et al.  The Cranfield tests on index language devices , 1997 .

[6]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[7]  Ellen M. Voorhees,et al.  Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.

[8]  Mark Sanderson,et al.  Advantages of query biased summaries in information retrieval , 1998, SIGIR '98.

[9]  Paul Over,et al.  The TREC-9 Interactive Track Report , 1999, TREC.

[10]  Ellen M. Voorhees Variations in relevance judgments and the measurement of retrieval effectiveness , 2000, Inf. Process. Manag..

[11]  Andrew Turpin,et al.  Do batch and user evaluations give the same results? , 2000, SIGIR '00.

[12]  Ellen M. Voorhees,et al.  The Philosophy of Information Retrieval Evaluation , 2001, CLEF.

[13]  Andrew Turpin,et al.  Why batch and user evaluations do not give the same results , 2001, SIGIR '01.

[14]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[15]  William R. Hersh,et al.  TREC 2002 Interactive Track Report , 2002, TREC.

[16]  David Hawking,et al.  Overview of the TREC 2003 Web Track , 2003, TREC.

[17]  Peter Bailey,et al.  Engineering a multi-purpose test collection for Web retrieval experiments , 2003, Inf. Process. Manag..

[18]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[19]  Ellen M. Voorhees,et al.  Retrieval System Evaluation , 2005 .

[20]  Thorsten Joachims,et al.  Accurately interpreting clickthrough data as implicit feedback , 2005, SIGIR '05.

[21]  Ellen M. Voorhees,et al.  TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing) , 2005 .

[22]  Donna K. Harman,et al.  The TREC Test Collections , 2005 .

[23]  James Allan,et al.  When will information retrieval be "good enough"? , 2005, SIGIR '05.

[24]  D. Harman,et al.  TREC: Experiment and Evaluation in Information Retrieval , 2006 .

[25]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..