Test Collection Based Evaluation of Information Retrieval Systems

Use of test collections and evaluation measures to assess the effectiveness of information retrieval systems has its origins in work dating back to the early 1950s. Across the nearly 60 years since that work started, use of test collections is a de facto standard of evaluation. This monograph surveys the research conducted and explains the methods and measures devised for evaluation of retrieval systems, including a detailed look at the use of statistical significance testing in retrieval experimentation. This monograph reviews more recent examinations of the validity of the test collection approach and evaluation measures as well as outlining trends in current research exploiting query logs and live labs. At its core, the modern-day test collection is little different from the structures that the pioneering researchers in the 1950s and 1960s conceived of. This tutorial and review shows that despite its age, this long-standing evaluation method is still a highly valued tool for retrieval research.

[1]  Mit freundlichen Grüßen,et al.  [Letter to the Editor]. , 2020, Gesundheitswesen (Bundesverband der Arzte des Offentlichen Gesundheitsdienstes (Germany)).

[2]  Mark Levene,et al.  Search Engines: Information Retrieval in Practice , 2011, Comput. J..

[3]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[4]  Stephen E. Robertson,et al.  On the choice of effectiveness measures for learning to rank , 2010, Information Retrieval.

[5]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[6]  John Tait,et al.  CLEF-IP 2009: Retrieval Experiments in the Intellectual Property Domain , 2009, CLEF.

[7]  Marti A. Hearst Search User Interfaces , 2009 .

[8]  Charles L. A. Clarke,et al.  An Effectiveness Measure for Ambiguous and Underspecified Queries , 2009, ICTIR.

[9]  Ellen M. Voorhees,et al.  Topic set size redux , 2009, SIGIR.

[10]  Gabriella Kazai,et al.  Towards methods for the collective gathering and quality control of relevance assessments , 2009, SIGIR.

[11]  Ben Carterette,et al.  On rank correlation and the distance between rankings , 2009, SIGIR.

[12]  Milad Shokouhi,et al.  Robust result merging using sample-based score estimates , 2009, TOIS.

[13]  Jiayu Tang,et al.  What Else Is There? Search Diversity Examined , 2009, ECIR.

[14]  W. Bruce Croft,et al.  Search Engines - Information Retrieval in Practice , 2009 .

[15]  Benjamin Piwowarski,et al.  Mining user web search activity with layered bayesian networks or how to capture a click in its context , 2009, WSDM '09.

[16]  Sreenivas Gollapudi,et al.  Diversifying search results , 2009, WSDM '09.

[17]  Carol Peters,et al.  Evaluating Systems for Multilingual and Multimodal Information Access, 9th Workshop of the Cross-Language Evaluation Forum, CLEF 2008, Aarhus, Denmark, September 17-19, 2008, Revised Selected Papers , 2009, CLEF.

[18]  G. Kazai INitiative for the Evaluation of XML Retrieval , 2009, Encyclopedia of Database Systems.

[19]  Ji-Rong Wen,et al.  Building a Test Collection for Evaluating Search Result Diversity : A Preliminary Study , 2009 .

[20]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[21]  Omar Alonso,et al.  Crowdsourcing for relevance evaluation , 2008, SIGF.

[22]  Ellen M. Voorhees On test collections for adaptive information retrieval , 2008, Inf. Process. Manag..

[23]  Filip Radlinski,et al.  How does clickthrough data reflect retrieval quality? , 2008, CIKM '08.

[24]  Leif Azzopardi,et al.  Retrievability: an evaluation measure for higher order information access tasks , 2008, CIKM '08.

[25]  Alistair Moffat,et al.  Statistical power in retrieval experimentation , 2008, CIKM '08.

[26]  Noriko Kando,et al.  On information retrieval metrics designed for evaluation with incomplete relevance assessments , 2008, Information Retrieval.

[27]  Overview of the ImageCLEFphoto 2008 Photographic Retrieval Task , 2008, CLEF.

[28]  Stephen E. Robertson,et al.  On the history of evaluation in IR , 2008, J. Inf. Sci..

[29]  Prasenjit Majumder,et al.  Text collections for FIRE , 2008, SIGIR '08.

[30]  Yong Yu,et al.  Exploring folksonomy for personalized search , 2008, SIGIR '08.

[31]  Stephen E. Robertson,et al.  A new rank correlation coefficient for information retrieval , 2008, SIGIR '08.

[32]  Benjamin Piwowarski,et al.  A user browsing model to predict search engine click data from past observations. , 2008, SIGIR '08.

[33]  James Allan,et al.  Evaluation over thousands of queries , 2008, SIGIR '08.

[34]  Charles L. A. Clarke,et al.  Novelty and diversity in information retrieval evaluation , 2008, SIGIR '08.

[35]  Alistair Moffat,et al.  Score standardization for inter-collection comparison of retrieval systems , 2008, SIGIR '08.

[36]  Catherine L. Smith,et al.  User adaptation: good results from poor systems , 2008, SIGIR '08.

[37]  Mark Sanderson,et al.  The good and the bad system: does the test collection predict users' effectiveness? , 2008, SIGIR '08.

[38]  Peter Bailey,et al.  Relevance assessment: are judges exchangeable and does it matter , 2008, SIGIR '08.

[39]  Andrew Trotman,et al.  Focused Access to XML Documents, 6th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2007, Dagstuhl Castle, Germany, December 17-19, 2007. Selected Papers , 2008, INEX.

[40]  Julio Gonzalo,et al.  Web people search: results of the first evaluation and the plan for the second , 2008, WWW.

[41]  David Maxwell Chickering,et al.  Here or There , 2008, ECIR.

[42]  Pertti Vakkari,et al.  Students' search process and outcome in Medline in writing an essay for a class on evidence-based medicine , 2008, J. Documentation.

[43]  Nick Craswell,et al.  An experimental comparison of click position-bias models , 2008, WSDM '08.

[44]  Julio Gonzalo,et al.  Large-scale interactive evaluation of multilingual information access systems: the iCLEF Flickr challenge , 2008 .

[45]  T. Minka Selection bias in the LETOR datasets , 2008 .

[46]  Gabriella Kazai,et al.  INEX 2007 Evaluation Measures , 2008, INEX.

[47]  Ben Carterette,et al.  Evaluating Search Engines by Modeling the Relationship Between Relevance and Clicks , 2007, NIPS.

[48]  Ellen M. Voorhees,et al.  Bias and the limits of pooling for large collections , 2007, Information Retrieval.

[49]  Noriko Kando,et al.  EVIA 2007: the First International Workshop on Evaluating Information Access , 2007, SIGF.

[50]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[51]  Ophir Frieder,et al.  Repeatable evaluation of search services in dynamic environments , 2007, TOIS.

[52]  Tetsuya Sakai,et al.  Evaluating Information Retrieval Metrics Based on Bootstrap Hypothesis Tests , 2007 .

[53]  Filip Radlinski,et al.  Search Engines that Learn from Implicit Feedback , 2007, Computer.

[54]  Stephen E. Robertson,et al.  Hits hits TREC: exploring IR evaluation results with network analysis , 2007, SIGIR.

[55]  Mark Sanderson,et al.  Problems with Kendall's tau , 2007, SIGIR.

[56]  Tetsuya Sakai,et al.  Alternatives to Bpref , 2007, SIGIR.

[57]  Pu Li,et al.  Test theory for assessing IR test collections , 2007, SIGIR.

[58]  Scott B. Huffman,et al.  How well does result relevance predict session satisfaction? , 2007, SIGIR.

[59]  Rajesh Shenoy,et al.  On the robustness of relevance measures with incomplete judgments , 2007, SIGIR.

[60]  Alistair Moffat,et al.  Strategic system comparisons via targeted relevance judgments , 2007, SIGIR.

[61]  Dan Morris,et al.  Investigating the querying and browsing behavior of advanced search engine users , 2007, SIGIR.

[62]  Mark Sanderson,et al.  The relationship between IR effectiveness measures and user satisfaction , 2007, SIGIR.

[63]  Ben Carterette,et al.  Robust test collections for retrieval evaluation , 2007, SIGIR.

[64]  Hongyuan Zha,et al.  A regression framework for learning ranking functions using relative relevance judgments , 2007, SIGIR.

[65]  M. de Rijke,et al.  Building simulated queries for known-item topics: an analysis using six european languages , 2007, SIGIR.

[66]  Charles L. A. Clarke,et al.  Reliable information retrieval evaluation with incomplete and biased judgements , 2007, SIGIR.

[67]  Stephen E. Robertson,et al.  On rank-based effectiveness measures and optimization , 2007, Information Retrieval.

[68]  Mounia Lalmas,et al.  Evaluating XML retrieval effectiveness at INEX , 2007, SIGF.

[69]  Massimo Melucci,et al.  On rank correlation in information retrieval evaluation , 2007, SIGF.

[70]  Mark Baillie,et al.  A Retrieval Evaluation Methodology for Incomplete Relevance Assessments , 2007, ECIR.

[71]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[72]  Amanda Spink,et al.  Web searcher interaction with the Dogpile.com metasearch engine , 2007, J. Assoc. Inf. Sci. Technol..

[73]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[74]  Tao Qin,et al.  LETOR: Benchmark Dataset for Research on Learning to Rank for Information Retrieval , 2007 .

[75]  Falk Scholer,et al.  A comparison of evaluation measures given how users perform on search tasks , 2007 .

[76]  Fredric C. Gey,et al.  GeoCLEF 2008: the CLEF 2008 Cross-Language Geographic Information Retrieval Track Overview , 2008, CLEF.

[77]  Gabriella Kazai,et al.  INEX 2006 Evaluation Measures , 2006, INEX.

[78]  Emine Yilmaz,et al.  Estimating average precision with incomplete and imperfect judgments , 2006, CIKM '06.

[79]  David Hawking,et al.  Evaluation by comparing result sets in context , 2006, CIKM '06.

[80]  Stephen E. Robertson,et al.  On GMAP: and other transformations , 2006, CIKM '06.

[81]  Paul Over,et al.  TRECVID 2006 Overview , 2006, TRECVID.

[82]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[83]  Tetsuya Sakai,et al.  Evaluating evaluation metrics based on the bootstrap , 2006, SIGIR.

[84]  Gordon V. Cormack,et al.  Statistical precision of information retrieval evaluation , 2006, SIGIR.

[85]  Ian Soboroff,et al.  Dynamic test collections: measuring search effectiveness on the live web , 2006, SIGIR.

[86]  Emine Yilmaz,et al.  A statistical method for system evaluation using incomplete judgments , 2006, SIGIR.

[87]  David R. Karger,et al.  Less is More Probabilistic Models for Retrieving Fewer Relevant Documents , 2006 .

[88]  Falk Scholer,et al.  User performance versus precision measures for simple search tasks , 2006, SIGIR.

[89]  James Allan,et al.  Minimal test collections for retrieval evaluation , 2006, SIGIR.

[90]  Susan T. Dumais,et al.  Learning user interaction models for predicting web search result preferences , 2006, SIGIR.

[91]  Gabriella Kazai,et al.  Advances in XML Information Retrieval and Evaluation: 4th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2005, Dagstuhl ... Papers (Lecture Notes in Computer Science) , 2006 .

[92]  Stephen E. Robertson,et al.  Creating a Test Collection for Citation-based IR Experiments , 2006, NAACL.

[93]  Jimmy J. Lin,et al.  Building a reusable test collection for question answering , 2006, J. Assoc. Inf. Sci. Technol..

[94]  Rabia Nuray-Turan,et al.  Automatic ranking of information retrieval systems using data fusion , 2006, Inf. Process. Manag..

[95]  Susan T. Dumais,et al.  Fast, flexible filtering with phlat , 2006, CHI.

[96]  Stefano Mizzaro,et al.  A Classification of IR Effectiveness Metrics , 2006, ECIR.

[97]  Massimo Barbaro,et al.  A Face Is Exposed for AOL Searcher No , 2006 .

[98]  Ophir Frieder,et al.  Repeatable evaluation of information retrieval effectiveness in dynamic environments , 2006 .

[99]  Charles L. A. Clarke,et al.  The TREC 2006 Terabyte Track , 2006, TREC.

[100]  Douglas W. Oard,et al.  TREC 2006 Legal Track Overview , 2006, TREC.

[101]  Amanda Spink,et al.  How are we searching the World Wide Web? A comparison of nine search engine transaction logs , 2006, Inf. Process. Manag..

[102]  Craig MacDonald,et al.  Overview of the TREC 2006 Blog Track , 2006, TREC.

[103]  Gabriella Kazai,et al.  INEX 2005 Evaluation Measures , 2005, INEX.

[104]  Ellen M. Voorhees,et al.  Retrieval System Evaluation , 2005 .

[105]  Peter Ingwersen,et al.  The Turn - Integration of Information Seeking and Retrieval in Context , 2005, The Kluwer International Series on Information Retrieval.

[106]  Fredric C. Gey,et al.  GeoCLEF: the CLEF 2005 Cross-Language Geographic Information Retrieval Track , 2005, CLEF.

[107]  Ellen M. Voorhees,et al.  TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing) , 2005 .

[108]  Mark Sanderson,et al.  Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.

[109]  James Allan,et al.  When will information retrieval be "good enough"? , 2005, SIGIR '05.

[110]  Emine Yilmaz,et al.  The maximum entropy method for analyzing retrieval measures , 2005, SIGIR '05.

[111]  Emine Yilmaz,et al.  A geometric interpretation of r-precision and its correlation with average precision , 2005, SIGIR '05.

[112]  Thorsten Joachims,et al.  Accurately interpreting clickthrough data as implicit feedback , 2005, SIGIR '05.

[113]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[114]  Donna K. Harman,et al.  The TREC Ad Hoc Experiments , 2005 .

[115]  Steve Fox,et al.  Evaluating implicit measures to improve web search , 2005, TOIS.

[116]  James Allan,et al.  Passage Retrieval and Evaluation , 2005 .

[117]  Thomas Martin Deserno,et al.  The CLEF 2005 Cross-Language Image Retrieval Track , 2003, CLEF.

[118]  Noriko Kando,et al.  Overview of Patent Retrieval Task at NTCIR-5 , 2005, NTCIR.

[119]  Nick Craswell,et al.  Overview of the TREC 2005 Enterprise Track , 2005, TREC.

[120]  Marti A. Hearst,et al.  TREC 2007 Genomics Track Overview , 2007, TREC.

[121]  Maarten de Rijke,et al.  Overview of the CLEF 2004 Multilingual Question Answering Track , 2004, CLEF.

[122]  Donna K. Harman,et al.  The NRRC reliable information access (RIA) workshop , 2004, SIGIR '04.

[123]  Ian Soboroff On evaluating web search with very few relevant documents , 2004, SIGIR '04.

[124]  Bhuvana Ramabhadran,et al.  Building an information retrieval test collection for spontaneous conversational speech , 2004, SIGIR '04.

[125]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[126]  Gabriella Kazai,et al.  The overlap problem in content-oriented XML retrieval evaluation , 2004, SIGIR '04.

[127]  David Carmel,et al.  Scaling IR-system evaluation using term relevance sets , 2004, SIGIR '04.

[128]  Mark Sanderson,et al.  Forming test collections with no system pooling , 2004, SIGIR '04.

[129]  Mark Sanderson,et al.  The CLEF Cross Language Image Retrieval Track (ImageCLEF) 2004 , 2004, CLEF.

[130]  Laura A. Dabbish,et al.  Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[131]  Longzhuang Li,et al.  Precision Evaluation of Search Engines , 2004, World Wide Web.

[132]  Ellen M. Voorhees,et al.  The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text , 2000, Information Retrieval.

[133]  Stephen E. Robertson,et al.  On Collection Size and Retrieval Effectiveness , 2004, Information Retrieval.

[134]  Tetsuya Sakai,et al.  New Performance Metrics Based on Multigrade Relevance: Their Application to Question Answering , 2004, NTCIR.

[135]  Noriko Kando,et al.  Pooling for a Large-Scale Test Collection: An Analysis of the Search Results from the First NTCIR Workshop , 2004, Information Retrieval.

[136]  Carol Peters,et al.  Cross-Language Evaluation Forum: Objectives, Results, Achievements , 2004, Information Retrieval.

[137]  Rabia Nuray-Turan,et al.  Automatic performance evaluation of Web search engines , 2004, Inf. Process. Manag..

[138]  Ellen M. Voorhees,et al.  Overview of TREC 2004 , 2004, TREC.

[139]  David Hawking,et al.  How Valuable is External Link Evidence When Searching Enterprise Webs? , 2004, ADC.

[140]  G. Gigerenzer Mindless statistics , 2004 .

[141]  Ellen M. Voorhees,et al.  Overview of the TREC 2004 Robust Retrieval Track , 2004 .

[142]  E. Voorhees Overview of the TREC 2003 Question Answering Track , 2004, TREC.

[143]  Javed A. Aslam,et al.  A unified model for metasearch, pooling, and system evaluation , 2003, CIKM '03.

[144]  Noriko Kando,et al.  Evaluation of Information Access Technologies at NTCIR Workshop , 2003, CLEF.

[145]  Pia Borlund,et al.  The concept of relevance in IR , 2003, J. Assoc. Inf. Sci. Technol..

[146]  William W. Cohen,et al.  Beyond independent relevance: methods and evaluation metrics for subtopic retrieval , 2003, SIGIR.

[147]  Stephen E. Robertson,et al.  Building a filtering test collection for TREC 2002 , 2003, SIGIR.

[148]  Shengli Wu,et al.  Methods for ranking information retrieval systems without relevance judgments , 2003, SAC '03.

[149]  Paul Over,et al.  TRECVID-An Overview , 2003, TRECVID.

[150]  Thorsten Joachims,et al.  Evaluating Retrieval Performance Using Clickthrough Data , 2003, Text Mining.

[151]  David Hawking,et al.  Overview of the TREC-2002 Web Track , 2002, TREC.

[152]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[153]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[154]  Ellen M. Voorhees,et al.  The effect of topic set size on retrieval experiment error , 2002, SIGIR '02.

[155]  Andrew Turpin,et al.  User interface effects in past batch versus user experiments , 2002, SIGIR '02.

[156]  Eero Sormunen,et al.  Liberal relevance criteria of TREC -: counting on negligible documents? , 2002, SIGIR '02.

[157]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[158]  Dan Klein,et al.  Evaluating strategies for similarity search on the web , 2002, WWW '02.

[159]  Christine L Borgman,et al.  Final Report to the National Science Foundation , 2002 .

[160]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[161]  Charles P. Friedman,et al.  Research Paper: Factors Associated with Success in Searching MEDLINE and Applying Evidence to Answer Clinical Questions , 2002, J. Am. Medical Informatics Assoc..

[162]  N. Fuhr PAN-Uncovering Plagiarism , Authorship , and Social Software Misuse ImageCLEF 2013-Cross Language Image Annotation and Retrieval INEX-INitiative for the Evaluation of XML retrieval , 2002 .

[163]  Donna K. Harman,et al.  Overview of the TREC 2002 Novelty Track , 2002, TREC.

[164]  Paul Over,et al.  The TREC-2002 Video Track Report , 2002, TREC.

[165]  Ian Soboroff,et al.  Ranking retrieval systems without relevance judgments , 2001, SIGIR '01.

[166]  Ellen M. Voorhees,et al.  Evaluation by highly relevant documents , 2001, SIGIR '01.

[167]  R. Manmatha,et al.  Modeling score distributions for combining the outputs of search engines , 2001, SIGIR '01.

[168]  Andrew Turpin,et al.  Why batch and user evaluations do not give the same results , 2001, SIGIR '01.

[169]  Paul Over,et al.  The TREC interactive track: an annotated bibliography , 2001, Inf. Process. Manag..

[170]  Ellen M. Voorhees,et al.  Overview of TREC 2001 , 2001, TREC.

[171]  W. Hersh,et al.  Factors associated with successful answering of clinical questions using an information retrieval system. , 2002, Bulletin of the Medical Library Association.

[172]  Andrew Turpin,et al.  Do batch and user evaluations give the same results? , 2000, SIGIR '00.

[173]  Stephen E. Robertson,et al.  Salton Award Lecture on theoretical argument in information retrieval , 2000, SIGF.

[174]  Jaana Kekäläinen,et al.  IR evaluation methods for retrieving highly relevant documents , 2000, SIGIR '00.

[175]  David Hawking,et al.  Overview of the TREC-9 Web Track , 2000, TREC.

[176]  Ellen M. Voorhees,et al.  The TREC Spoken Document Retrieval Track: A Success Story , 2000, TREC.

[177]  M. Shubik,et al.  Opening Plenary Session , 1999 .

[178]  Paul Over,et al.  The TREC-9 Interactive Track Report , 1999, TREC.

[179]  Donna K. Harman,et al.  Overview of the Eighth Text REtrieval Conference (TREC-8) , 1999, TREC.

[180]  Peter Bailey,et al.  ACSys TREC-8 Experiments , 1999, TREC.

[181]  Stephen E. Robertson,et al.  The TREC-8 Filtering Track Final Report , 1999, TREC.

[182]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track Report , 1999, TREC.

[183]  Mark Sanderson,et al.  Accurate user directed summarization from existing tools , 1998, CIKM '98.

[184]  Charles L. A. Clarke,et al.  Efficient construction of large test collections , 1998, SIGIR '98.

[185]  Ellen M. Voorhees,et al.  Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.

[186]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[187]  Ellen M. Voorhees,et al.  Overview of the Seventh Text REtrieval Conference , 1998 .

[188]  S. Ward Winner takes all. , 1998, Nursing standard (Royal College of Nursing (Great Britain) : 1987).

[189]  Peter Ingwersen,et al.  The development of a method for the evaluation of interactive information retrieval systems , 1997, J. Documentation.

[190]  Michael E. Lesk,et al.  Real life information retrieval (panel): commercial search engines , 1997, SIGIR '97.

[191]  Jacques Savoy,et al.  Statistical inference in retrieval effectiveness evaluation , 1997, Inf. Process. Manag..

[192]  Mark D. Dunlop,et al.  Image retrieval by hypertext links , 1997, SIGIR '97.

[193]  Peter Schäuble,et al.  Cross-language speech retrieval: establishing a baseline performance , 1997, SIGIR '97.

[194]  Paul Over,et al.  TREC-6 Interactive Report , 1997, TREC.

[195]  Gerald J. Kowalski,et al.  Information Retrieval Systems , 1997, The Information Retrieval Series.

[196]  David Hawking,et al.  Overview of TREC-7 Very Large Collection Track , 1997, TREC.

[197]  D. K. Harmon,et al.  Overview of the Third Text Retrieval Conference (TREC-3) , 1996 .

[198]  Ralph Grishman,et al.  Message Understanding Conference- 6: A Brief History , 1996, COLING.

[199]  Alan F. Smeaton,et al.  Spanish and Chinese Document Retrieval in TREC-5 , 1996, TREC.

[200]  Jean Tague-Sutcliffe,et al.  Some Perspectives on the Evaluation of Information Retrieval Systems , 1996, J. Am. Soc. Inf. Sci..

[201]  David C. Blair STAIRS Redux: Thoughts on the STAIRS Evaluation, Ten Years after , 1996, J. Am. Soc. Inf. Sci..

[202]  David D. Lewis,et al.  The TREC-5 Filtering Track , 1996, TREC.

[203]  Daniel E. Rose,et al.  V-Twin: A Lightweight Engine for Interactive Use , 1996, TREC.

[204]  Tefko Saracevic,et al.  Evaluation of evaluation in information retrieval , 1995, SIGIR '95.

[205]  W. Pearson Comparison of methods for searching protein sequence databases , 1995, Protein science : a publication of the Protein Society.

[206]  Donna K. Harman,et al.  Overview of the Fourth Text REtrieval Conference (TREC-4) , 1995, TREC.

[207]  Donna K. Harman,et al.  Overview of the Third Text REtrieval Conference (TREC-3) , 1995, TREC.

[208]  James Blustein,et al.  A Statistical Analysis of the TREC-3 Data , 1995, TREC.

[209]  Yoichi Shinoda,et al.  Information filtering based on user behavior analysis and best match text retrieval , 1994, SIGIR '94.

[210]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[211]  Louise T. Su The Relevance of Recall and Precision in User Evaluation , 1994, J. Am. Soc. Inf. Sci..

[212]  Donna K. Harman,et al.  Overview of the Second Text REtrieval Conference (TREC-2) , 1994, HLT.

[213]  David A. Hull Using statistical testing in the evaluation of retrieval experiments , 1993, SIGIR.

[214]  James Allan,et al.  Approaches to passage retrieval in full text information systems , 1993, SIGIR.

[215]  Christian Plaunt,et al.  Subtopic structuring for full-length document access , 1993, SIGIR.

[216]  Ellen M. Voorhees,et al.  On Expanding Query Vectors with Lexically Related Words , 1993, TREC.

[217]  E. Michael Keen,et al.  Presenting Results of Experimental Retrieval Comparisons , 1997, Inf. Process. Manag..

[218]  Joseph P. Romano,et al.  Bootstrap technology and applications , 1992 .

[219]  Robert Burgin Variations in Relevance Judgments and the Evaluation of Retrieval Performance , 1992, Inf. Process. Manag..

[220]  Louise T. Su Evaluation Measures for Interactive Information Retrieval , 1992, Inf. Process. Manag..

[221]  Robert H. Ledwith On the Difficulties of Applying the Results of Information Retrieval Research to Aid in the Searching of Larg Scientific Databases , 1992, Inf. Process. Manag..

[222]  Donna K. Harman,et al.  Evaluation Issues in Information Retrieval , 1992, Inf. Process. Manag..

[223]  Mark Sanderson,et al.  NRT: News Retrieval Tool , 1991, Electron. Publ..

[224]  Cyril W. Cleverdon,et al.  The significance of the Cranfield tests on index languages , 1991, SIGIR '91.

[225]  Peter Schäuble,et al.  Determining the effectiveness of retrieval algorithms , 1991, Inf. Process. Manag..

[226]  Donna Harman,et al.  Retrieving Records from a Gigabyte of Text on a Minicomputer Using Statistical Ranking. , 1990 .

[227]  Mark E. Rorvig,et al.  The Simple Scalability of Documents. , 1990 .

[228]  Douglas G. Altman,et al.  Practical statistics for medical research , 1990 .

[229]  Stephen E. Robertson,et al.  On sample sizes for non-matched-pair IR experiments , 1990, Inf. Process. Manag..

[230]  Norbert Fuhr,et al.  Optimum polynomial retrieval functions based on the probability ranking principle , 1989, TOIS.

[231]  Robert Tibshirani,et al.  Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy , 1986 .

[232]  William M. Shaw,et al.  On the foundation of evaluation , 1986, J. Am. Soc. Inf. Sci..

[233]  M. E. Maron,et al.  An evaluation of retrieval effectiveness for a full-text document-retrieval system , 1985, CACM.

[234]  Norbert Fuhr,et al.  Retrieval Test Evaluation of a Rule Based Automatic Index (AIR/PHYS) , 1984, SIGIR.

[235]  G. Salton,et al.  Extended Boolean information retrieval , 1983, CACM.

[236]  Edward A. Fox,et al.  Characterization of Two New Experimental Collections in Computer and Information Science Containing Textual and Bibliographic Concepts , 1983 .

[237]  Jeffrey Katzer,et al.  A study of the overlap among document representations , 1983, SIGIR '83.

[238]  Jean Tague-Sutcliffe,et al.  Simulation of User Judgments in Bibliographic Retrieval Systems , 1981, SIGIR.

[239]  Jeffrey Katzer A Study of the Impact of Representations in Information Retrieval Systems. , 1981 .

[240]  Gerard Salton,et al.  Automatic indexing , 1980, ACM '80.

[241]  F. W. Lancaster,et al.  Evaluation of the MEDLARS demand search service , 1980 .

[242]  W. Bruce Croft A file organization for cluster-based retrieval , 1978, SIGIR '78.

[243]  R. Tagliacozzo Estimating the satisfaction of information users. , 1977, Bulletin of the Medical Library Association.

[244]  Abraham Bookstein,et al.  When the most "pertinent" document should not be retrieved - An analysis of the Swets model , 1977, Inf. Process. Manag..

[245]  K. Sparck Jones,et al.  INFORMATION RETRIEVAL TEST COLLECTIONS , 1976 .

[246]  Tefko Saracevic,et al.  RELEVANCE: A review of and a framework for the thinking on the notion in information science , 1997, J. Am. Soc. Inf. Sci..

[247]  C. J. van Rijsbergen,et al.  Report on the need for and provision of an 'ideal' information retrieval test collection , 1975 .

[248]  Clement T. Yu,et al.  On the construction of effective vocabularies for information retrieval , 1974, SIGPLAN '73.

[249]  William S. Cooper,et al.  On selecting a measure of retrieval effectiveness , 1973, J. Am. Soc. Inf. Sci..

[250]  William S. Cooper,et al.  A definition of relevance for information retrieval , 1971, Inf. Storage Retr..

[251]  Karen Sparck Jones Information Retrieval Experiment , 1971 .

[252]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[253]  Michael E. Lesk,et al.  The cornell implementation of the smart system , 1971 .

[254]  Cyril W. Cleverdon The effect of variations in relevance assessments in comparative experimental tests of index languages , 1970 .

[255]  P. K. T. Vaswani,et al.  The National Physical Laboratory Experiments in Statistical Word Associations and Their Use in Document Indexing And Retrieval. , 1970 .

[256]  Harold Borko,et al.  Encyclopedia of library and information science , 1970 .

[257]  S. E. Robertson,et al.  THE PARAMETRIC DESCRIPTION OF RETRIEVAL TESTS , 1969 .

[258]  John A. Swets,et al.  Effectiveness of information retrieval methods , 1969 .

[259]  Michael E. Lesk,et al.  Relevance assessments and retrieval system evaluation , 1968, Inf. Storage Retr..

[260]  S. Pollock Measures for the comparison of information retrieval systems , 1968 .

[261]  R. V. Katter The influence of scale form on relevance judgments , 1968, Inf. Storage Retr..

[262]  Michael E. Lesk,et al.  Computer Evaluation of Indexing and Text Processing , 1968, JACM.

[263]  W. S. Cooper Expected search length: A single measure of retrieval effectiveness based on the weak ordering action of retrieval systems , 1968 .

[264]  F. W. Lancaster,et al.  Information retrieval systems; characteristics, testing, and evaluation , 1968 .

[265]  Gerard Salton,et al.  Automatic Information Organization And Retrieval , 1968 .

[266]  Douglas G. Schultz,et al.  A Field Experimental Approach to the Study of Relevance Assessments in Relation to Document Searching. Final Report to the National Science Foundation. Volume II, Appendices. , 1967 .

[267]  D Meister,et al.  Evaluation of user reactions to a prototype on-line information retrieval system. NASA CR-918. , 1967, NASA contractor report. NASA CR. United States. National Aeronautics and Space Administration.

[268]  Michael Keen,et al.  ASLIB CRANFIELD RESEARCH PROJECT FACTORS DETERMINING THE PERFORMANCE OF INDEXING SYSTEMS VOLUME 2 , 1966 .

[269]  William Goffman,et al.  On relevance as a measure , 1964, Inf. Storage Retr..

[270]  E. M. Fels,et al.  Evaluation of the performance of an information‐retrieval system by modified mooers plan , 1963 .

[271]  Cyril W. Cleverdon,et al.  Aslib Cranfield research project: report on the testing and analysis of an investigation into the comparative efficiency of indexing systems , 1962 .

[272]  B. K. Dennis,et al.  Index Manipulation and Abstract Retrieval by Computer. , 1962 .

[273]  Harold Borko EVALUATING THE EFFECTIVENESS OF INFORMATION RETRIEVAL SYSTEMS , 1962 .

[274]  C. Cleverdon Report on the testing and analysis of an investigation into comparative efficiency of indexing systems , 1962 .

[275]  William Goffman,et al.  Inefficiency of the use of Boolean functions for information retrieval systems , 1961, Commun. ACM.

[276]  Harry Bornstein,et al.  A paradigm for a retrieval effectiveness experiment , 1961 .

[277]  Brian Vickery,et al.  On retrieval system theory , 1961 .

[278]  M. E. Maron,et al.  PROBABILISTIC INDEXING. A STATISTICAL TECHNIQUE FOR DOCUMENT IDENTIFICATION AND RETRIEVAL , 1959 .

[279]  Calvin N. Mooers The next twenty years in information retrieval: some goals and predictions , 1959, IRE-AIEE-ACM '59 (Western).

[280]  C. D. Gull Seven years of work on the organization of materials in the special library , 1956 .

[281]  Allen Kent,et al.  Machine literature searching VIII. Operational criteria for designing information retrieval systems , 1955 .

[282]  R. G. Thorne THE EFFICIENCY OF SUBJECT CATALOGUES AND THE COST OF INFORMATION SEARCHES , 1955 .

[283]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .