论文信息 - BEST PRACTICES FOR TEST COLLECTION CREATION AND INFORMATION RETRIEVAL SYSTEM EVALUATION

BEST PRACTICES FOR TEST COLLECTION CREATION AND INFORMATION RETRIEVAL SYSTEM EVALUATION

There is a widely held perception that the evaluation of searching systems is a difficult, time consuming, expensive process to get right. A great deal of that perception is due to the impression given by a number of annual academic evaluation campaigns that appear to need large quantities of money and a substantial community effort in order to conduct evaluation to a necessary level of accuracy. However a large quantity of research has been conducted showing that evaluation can be conducted far quicker than is generally thought. The broad conclusions of this research has not yet been collated into a single publication. In addition, almost all publications on evaluation of information retrieval systems are geared towards academic research. The needs of this community are not the same as the needs of the users, administrators and designers of actual searching systems. While the academic community is willing to work with shared testing resources that are perhaps overly abstract representations of a searching situation, practitioners cannot use these resources; they need a way to test on their actual data sets in order to understand exactly which queries are working and which are failing. If they are trying to decide which of two commercial systems they will choose to purchase, it will be critical that they are able to test on their own datasets. Surprisingly little has been written for practitioners on how to assess the quality of an operational search system, even less has been written about how to conduct such an evaluation quickly with minimal use of resource. This document attempts to address this gap in publications by providing a guide on how to conduct an evaluation. The choice of collection, how to source topics, how to conduct relevance assessments, and which of the many available evaluation measures to use is described in this document. The issues surrounding use of testing multilingual search are also addressed. The document also describes two case studies, illustrating, in the first, how one of the large research oriented evaluation campaigns constructs test collections; in the second, how a comparison between two searching systems was conducted by an organisation on its own data.

Mark Sanderson | M. Sanderson

[1] Robert Tibshirani,et al. An Introduction to the Bootstrap , 1994 .

[2] James Allan,et al. A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[3] Andrei Broder,et al. A taxonomy of web search , 2002, SIGF.

[4] R. V. Katter. The influence of scale form on relevance judgments , 1968, Inf. Storage Retr..

[5] Tefko Saracevic,et al. Evaluation of evaluation in information retrieval , 1995, SIGIR '95.

[6] Mark Sanderson,et al. Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.

[7] Donna K. Harman,et al. Overview of the Second Text REtrieval Conference (TREC-2) , 1994, HLT.

[8] Jaana Kekäläinen,et al. Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[9] Dan Morris,et al. Investigating the querying and browsing behavior of advanced search engine users , 2007, SIGIR.

[10] William S. Cooper,et al. On selecting a measure of retrieval effectiveness , 1973, J. Am. Soc. Inf. Sci..

[11] Ben Carterette,et al. Evaluating Search Engines by Modeling the Relationship Between Relevance and Clicks , 2007, NIPS.