Test Collection-Based IR Evaluation Needs Extension toward Sessions - A Case of Extremely Short Queries

There is overwhelming evidence suggesting that the real users of IR systems often prefer using extremely short queries (one or two individual words) but they try out several queries if needed. Such behavior is fundamentally different from the process modeled in the traditional test collection-based IR evaluation based on using more verbose queries and only one query per topic. In the present paper, we propose an extension to the test collection-based evaluation. We will utilize sequences of short queries based on empirically grounded but idealized session strategies. We employ TREC data and have test persons to suggest search words, while simulating sessions based on the idealized strategies for repeatability and control. The experimental results show that, surprisingly, web-like very short queries (including one-word query sequences) typically lead to good enough results even in a TREC type test collection. This finding motivates the observed real user behavior: as few very simple attempts normally lead to good enough results, there is no need to pay more effort. We conclude by discussing the consequences of our finding for IR evaluation.

[1]  Eero Sormunen,et al.  Liberal relevance criteria of TREC -: counting on negligible documents? , 2002, SIGIR '02.

[2]  Don R. Swanson,et al.  Information Retrieval as a Trial-And-Error Process , 1977, The Library Quarterly.

[3]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[4]  Ellen M. Voorhees,et al.  TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing) , 2005 .

[5]  Louise T. Su Evaluation Measures for Interactive Information Retrieval , 1992, Inf. Process. Manag..

[6]  Amanda Spink,et al.  Real life, real users, and real needs: a study and analysis of user queries on the web , 2000, Inf. Process. Manag..

[7]  Marcia J. Bates,et al.  The design of browsing and berrypicking techniques for the online search interface , 1989 .

[8]  L. Azzopardi Position paper: towards evaluating the user experience ofinteractive information access systems , 2007 .

[9]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[10]  Peter Ingwersen,et al.  Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.

[11]  Thorsten Joachims,et al.  Accurately interpreting clickthrough data as implicit feedback , 2005, SIGIR '05.

[12]  Gerard Salton,et al.  Evaluation problems in interactive information retrieval , 1969, Inf. Storage Retr..

[13]  Lois M. L. Delcambre,et al.  Discounted Cumulated Gain Based Evaluation of Multiple-Query IR Sessions , 2008, ECIR.

[14]  Pertti Vakkari,et al.  The influence of relevance levels on the effectiveness of interactive information retrieval , 2004, J. Assoc. Inf. Sci. Technol..

[15]  Cyril W. Cleverdon,et al.  Factors determining the performance of indexing systems , 1966 .

[16]  William R. Hersh,et al.  Relevance and Retrieval Evaluation: Perspectives from Medicine , 1994, J. Am. Soc. Inf. Sci..

[17]  Ergonomic requirements for office work with visual display terminals ( VDTs ) — Part 11 : Guidance on usability , 1998 .

[18]  Andrew Turpin,et al.  Why batch and user evaluations do not give the same results , 2001, SIGIR '01.

[19]  Mark Sanderson,et al.  Ambiguous queries: test collections need more sense , 2008, SIGIR '08.

[20]  Catherine L. Smith,et al.  User adaptation: good results from poor systems , 2008, SIGIR '08.

[21]  Lois M. L. Delcambre,et al.  Semantic components enhance retrieval of domain-specific documents , 2007, CIKM '07.

[22]  Dick Stenmark Identifying clusters of user behavior in intranet search engine log files , 2008, J. Assoc. Inf. Sci. Technol..