BEST PRACTICES FOR TEST COLLECTION CREATION AND INFORMATION RETRIEVAL SYSTEM EVALUATION

There is a widely held perception that the evaluation of searching systems is a difficult, time consuming, expensive process to get right. A great deal of that perception is due to the impression given by a number of annual academic evaluation campaigns that appear to need large quantities of money and a substantial community effort in order to conduct evaluation to a necessary level of accuracy. However a large quantity of research has been conducted showing that evaluation can be conducted far quicker than is generally thought. The broad conclusions of this research has not yet been collated into a single publication. In addition, almost all publications on evaluation of information retrieval systems are geared towards academic research. The needs of this community are not the same as the needs of the users, administrators and designers of actual searching systems. While the academic community is willing to work with shared testing resources that are perhaps overly abstract representations of a searching situation, practitioners cannot use these resources; they need a way to test on their actual data sets in order to understand exactly which queries are working and which are failing. If they are trying to decide which of two commercial systems they will choose to purchase, it will be critical that they are able to test on their own datasets. Surprisingly little has been written for practitioners on how to assess the quality of an operational search system, even less has been written about how to conduct such an evaluation quickly with minimal use of resource. This document attempts to address this gap in publications by providing a guide on how to conduct an evaluation. The choice of collection, how to source topics, how to conduct relevance assessments, and which of the many available evaluation measures to use is described in this document. The issues surrounding use of testing multilingual search are also addressed. The document also describes two case studies, illustrating, in the first, how one of the large research oriented evaluation campaigns constructs test collections; in the second, how a comparison between two searching systems was conducted by an organisation on its own data.

[1]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[2]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[3]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[4]  R. V. Katter The influence of scale form on relevance judgments , 1968, Inf. Storage Retr..

[5]  Tefko Saracevic,et al.  Evaluation of evaluation in information retrieval , 1995, SIGIR '95.

[6]  Mark Sanderson,et al.  Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.

[7]  Donna K. Harman,et al.  Overview of the Second Text REtrieval Conference (TREC-2) , 1994, HLT.

[8]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[9]  Dan Morris,et al.  Investigating the querying and browsing behavior of advanced search engine users , 2007, SIGIR.

[10]  William S. Cooper,et al.  On selecting a measure of retrieval effectiveness , 1973, J. Am. Soc. Inf. Sci..

[11]  Ben Carterette,et al.  Evaluating Search Engines by Modeling the Relationship Between Relevance and Clicks , 2007, NIPS.

[12]  Thorsten Joachims,et al.  Accurately Interpreting Clickthrough Data as Implicit Feedback , 2017 .

[13]  Donna K. Harman,et al.  The NRRC reliable information access (RIA) workshop , 2004, SIGIR '04.

[14]  Donna K. Harman,et al.  Evaluation Issues in Information Retrieval , 1992, Inf. Process. Manag..

[15]  Gerard Salton,et al.  Automatic Information Organization And Retrieval , 1968 .

[16]  Peter Bailey,et al.  ACSys TREC-8 Experiments , 1999, TREC.

[17]  Peter Bailey,et al.  Relevance assessment: are judges exchangeable and does it matter , 2008, SIGIR '08.

[18]  Gordon V. Cormack,et al.  Statistical precision of information retrieval evaluation , 2006, SIGIR.

[19]  Giorgio Maria Di Nunzio,et al.  How robust are multilingual information retrieval systems? , 2008, SAC '08.

[20]  Mark Levene,et al.  Search Engines: Information Retrieval in Practice , 2011, Comput. J..

[21]  Donna K. Harman,et al.  Overview of the Eighth Text REtrieval Conference (TREC-8) , 1999, TREC.

[22]  Noriko Kando,et al.  Pooling for a Large-Scale Test Collection: An Analysis of the Search Results from the First NTCIR Workshop , 2004, Information Retrieval.

[23]  Robert Tibshirani,et al.  Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy , 1986 .

[24]  Michael E. Lesk,et al.  Relevance assessments and retrieval system evaluation , 1968, Inf. Storage Retr..

[25]  Thomas Mandl,et al.  Linguistic and Statistical Analysis of the CLEF Topics , 2002, CLEF.

[26]  Noriko Kando,et al.  Evaluation of Information Access Technologies at NTCIR Workshop , 2003, CLEF.

[27]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[28]  Amanda Spink,et al.  How are we searching the World Wide Web? A comparison of nine search engine transaction logs , 2006, Inf. Process. Manag..

[29]  R. G. Thorne THE EFFICIENCY OF SUBJECT CATALOGUES AND THE COST OF INFORMATION SEARCHES , 1955 .

[30]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[31]  James Allan,et al.  Evaluation over thousands of queries , 2008, SIGIR '08.

[32]  Jacques Savoy,et al.  Statistical inference in retrieval effectiveness evaluation , 1997, Inf. Process. Manag..

[33]  Douglas G. Altman,et al.  Practical statistics for medical research , 1990 .

[34]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[35]  Donna K. Harman,et al.  The TREC Ad Hoc Experiments , 2005 .

[36]  Thomas Mandl Easy Tasks Dominate Information Retrieval Evaluation Results , 2009, BTW.

[37]  Christa Womser-Hacker Multilingual Topic Generation within the CLEF 2001 Experiments , 2001, CLEF.

[38]  Peter Ingwersen,et al.  The Turn - Integration of Information Seeking and Retrieval in Context , 2005, The Kluwer International Series on Information Retrieval.

[39]  Ellen M. Voorhees Variations in relevance judgments and the measurement of retrieval effectiveness , 2000, Inf. Process. Manag..

[40]  Carol Peters,et al.  Cross-Language Evaluation Forum: Objectives, Results, Achievements , 2004, Information Retrieval.

[41]  Stephen E. Robertson,et al.  On the history of evaluation in IR , 2008, J. Inf. Sci..

[42]  Joseph P. Romano,et al.  Bootstrap technology and applications , 1992 .

[43]  Ellen M. Voorhees,et al.  On Expanding Query Vectors with Lexically Related Words , 1993, TREC.

[44]  Ellen M. Voorhees,et al.  TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing) , 2005 .

[45]  Nicola Ferro,et al.  Managing the Knowledge Creation Process of Large-Scale Evaluation Campaigns , 2009, ECDL.

[46]  Mark Sanderson,et al.  The relationship between IR effectiveness measures and user satisfaction , 2007, SIGIR.

[47]  Karen Sparck Jones Information Retrieval Experiment , 1971 .

[48]  Jaana Kekäläinen,et al.  IR evaluation methods for retrieving highly relevant documents , 2000, SIGIR '00.

[49]  Peter Ingwersen,et al.  The development of a method for the evaluation of interactive information retrieval systems , 1997, J. Documentation.

[50]  Pia Borlund,et al.  The concept of relevance in IR , 2003, J. Assoc. Inf. Sci. Technol..

[51]  Daniel E. Rose,et al.  Understanding user goals in web search , 2004, WWW '04.

[52]  G. Gigerenzer Mindless statistics , 2004 .

[53]  David Hawking,et al.  Overview of the TREC-9 Web Track , 2000, TREC.