Repeatable evaluation of information retrieval effectiveness in dynamic environments

In dynamic environments, such as the World Wide Web, a changing document collection, query population, and set of search services demands frequent repetition of search effectiveness (relevance) evaluations. Reconstructing static test collections, such as in TREC, requires considerable human effort, as large collection sizes demand judgments deep into retrieved pools. In practice it is common to perform shallow evaluations over small numbers of conditions (often binary, A vs. B) without system pooling, not intending to construct reusable test collections. The query sample sizes required in such evaluations can be reliably estimated by leveraging the simple bootstrap estimate of the reproducibility probability (observed power) of hypothesis tests. However, they are typically much larger than those that are needed for static collections. We propose a semiautomatic evaluation framework to reduce this effort by enabling intelligent evaluation strategies. We validate this framework against a manual evaluation of the top ten results of ten web search engines across 896 queries in navigational and informational tasks. Augmenting manual judgments with pseudo-relevance judgments mined, even naively, from web taxonomies reduces both the chances of missing a correct binary conclusion and those of finding an errant conclusion by approximately half.

[1]  Peter Bailey,et al.  Measuring Search Engine Quality , 2001, Information Retrieval.

[2]  Shengli Wu,et al.  Methods for ranking information retrieval systems without relevance judgments , 2003, SAC '03.

[3]  Lawrence L. Kupper,et al.  How Appropriate are Popular Sample Size Formulas , 1989 .

[4]  Gary Marchionini,et al.  A Comparative Study of Web Search Service Performance , 1996 .

[5]  Jacques Savoy,et al.  Retrieval effectiveness on the web , 2001, Inf. Process. Manag..

[6]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[7]  Thorsten Joachims,et al.  Accurately interpreting clickthrough data as implicit feedback , 2005, SIGIR '05.

[8]  Peter Bacchetti,et al.  Peer review of statistics in medical research: the other problem , 2002, BMJ : British Medical Journal.

[9]  Nazli Goharian,et al.  Extracting unstructured data from template generated web documents , 2003, CIKM '03.

[10]  L. Hothorn,et al.  A Unified Approach to Simultaneous Rank Test Procedures in the Unbalanced One-way Layout , 2001 .

[11]  Charles L. A. Clarke,et al.  Overview of the TREC 2004 Terabyte Track , 2004, TREC.

[12]  Howard Greisdorf Relevance: An Interdisciplinary and Information Science Perspective , 2000, Informing Sci. Int. J. an Emerg. Transdiscipl..

[13]  David Carmel,et al.  Scaling IR-system evaluation using term relevance sets , 2004, SIGIR '04.

[14]  Howard C. Blue,et al.  Chapter 7. , 2007 .

[15]  J. MacKinnon,et al.  Bootstrap tests: how many bootstraps? , 2000 .

[16]  Amit Singhal,et al.  A case study in web search using TREC algorithms , 2001, WWW '01.

[17]  Ophir Frieder,et al.  Hourly analysis of a very large topically categorized web query log , 2004, SIGIR '04.

[18]  Susan A. Murphy,et al.  Monographs on statistics and applied probability , 1990 .

[19]  SpinkAmanda,et al.  How are we searching the world wide web , 2006 .

[20]  Amanda Spink,et al.  From E-Sex to E-Commerce: Web Search Changes , 2002, Computer.

[21]  Peter Bruza,et al.  Interactive Internet search: keyword, directory and query reformulation mechanisms compared , 2000, SIGIR '00.

[22]  Abdur Chowdhury,et al.  A picture of search , 2006, InfoScale '06.

[23]  Ophir Frieder,et al.  Surrogate scoring for improved metasearch precision , 2005, SIGIR '05.

[24]  Peter Bailey,et al.  Is it fair to evaluate Web systems using TREC ad hoc methods , 1999, SIGIR 1999.

[25]  Ophir Frieder,et al.  Predicting query difficulty on the web by learning visual clues , 2005, SIGIR '05.

[26]  Ophir Frieder,et al.  Using manually-built web directories for automatic evaluation of known-item retrieval , 2003, SIGIR.

[27]  Jaideep Srivastava,et al.  First 20 precision among World Wide Web search services (search engines) , 1999 .

[28]  Abdur Chowdhury,et al.  Automatic evaluation of world wide web search services , 2002, SIGIR '02.

[29]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[30]  Mark Sanderson,et al.  Forming test collections with no system pooling , 2004, SIGIR '04.

[31]  J. MacKinnon Applications of the Fast Double Bootstrap , 2006 .

[32]  Ellen M. Voorhees,et al.  Evaluation by highly relevant documents , 2001, SIGIR '01.

[33]  Harold Borko,et al.  Automatic indexing , 1981, ACM '81.

[34]  Amanda Spink,et al.  A temporal comparison of AltaVista Web searching , 2005, J. Assoc. Inf. Sci. Technol..

[35]  Filippo Menczer,et al.  A General Evaluation Framework for Topical Crawlers , 2005, Information Retrieval.

[36]  Dan Klein,et al.  Evaluating strategies for similarity search on the web , 2002, WWW '02.

[37]  Ophir Frieder,et al.  Temporal analysis of a very large topically categorized Web query log , 2007, J. Assoc. Inf. Sci. Technol..

[38]  Pia Borlund,et al.  The concept of relevance in IR , 2003, J. Assoc. Inf. Sci. Technol..

[39]  Ellen M. Voorhees,et al.  The effect of topic set size on retrieval experiment error , 2002, SIGIR '02.

[40]  Amanda Spink,et al.  Real life, real users, and real needs: a study and analysis of user queries on the web , 2000, Inf. Process. Manag..

[41]  Abdur Chowdhury,et al.  Using titles and category names from editor-driven taxonomies for automatic evaluation , 2003, CIKM '03.

[42]  James Allan,et al.  A critical examination of TDT's cost function , 2002, SIGIR '02.

[43]  Ian Soboroff,et al.  Ranking retrieval systems without relevance judgments , 2001, SIGIR '01.

[44]  David Hawking,et al.  Overview of TREC-7 Very Large Collection Track , 1997, TREC.

[45]  Rabia Nuray-Turan,et al.  Automatic performance evaluation of Web search engines , 2004, Inf. Process. Manag..

[46]  HenzingerMonika,et al.  Analysis of a very large web search engine query log , 1999 .

[47]  Abdur Chowdhury Automatic Evaluation of Web Search Services , 2005, Adv. Comput..

[48]  Ophir Frieder,et al.  A framework for determining necessary query set sizes to evaluate web search effectiveness , 2005, WWW '05.

[49]  Bart Selman,et al.  The Hidden Web , 1997, AI Mag..

[50]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[51]  Amanda Spink,et al.  An analysis of Web searching by European AlltheWeb.com users , 2005, Inf. Process. Manag..

[52]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[53]  Mark Sanderson,et al.  Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.

[54]  Christopher Olston,et al.  What's new on the web?: the evolution of the web from a search engine perspective , 2004, WWW '04.

[55]  James Allan,et al.  Incremental test collections , 2005, CIKM '05.

[56]  Ellen M. Voorhees,et al.  Evaluating evaluation measure stability , 2000, SIGIR '00.

[57]  J. Troendle Approximating the power of Wilcoxon's rank-sum test against shift alternatives. , 1999, Statistics in medicine.

[58]  J. Hoenig,et al.  Statistical Practice The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis , 2001 .

[59]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[60]  David Hawking,et al.  Which Search Engine is Best at Finding Online Services? , 2001, WWW Posters.

[61]  Jacques Savoy,et al.  Statistical inference in retrieval effectiveness evaluation , 1997, Inf. Process. Manag..

[62]  B. J. Collings,et al.  Estimating the power of the two-sample Wilcoxon test for location shift. , 1988, Biometrics.

[63]  Donna K. Harman,et al.  Results and Challenges in Web Search Evaluation , 1999, Comput. Networks.

[64]  Rabia Nuray-Turan,et al.  Automatic ranking of information retrieval systems using data fusion , 2006, Inf. Process. Manag..

[65]  Ellen M. Voorhees,et al.  Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.

[66]  David A. Hull Using statistical testing in the evaluation of retrieval experiments , 1993, SIGIR.

[67]  Gerard Salton,et al.  Automatic indexing , 1980, ACM '80.

[68]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[69]  Wei-Hao Lin,et al.  Revisiting the effect of topic set size on retrieval error , 2005, SIGIR '05.

[70]  Giles,et al.  Searching the world wide Web , 1998, Science.

[71]  Javed A. Aslam,et al.  A unified model for metasearch, pooling, and system evaluation , 2003, CIKM '03.

[72]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[73]  Longzhuang Li,et al.  Precision Evaluation of Search Engines , 2004, World Wide Web.

[74]  Dayne Freitag,et al.  A Machine Learning Architecture for Optimizing Web Search Engines , 1999 .

[75]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[76]  James Blustein,et al.  A Statistical Analysis of the TREC-3 Data , 1995, TREC.

[77]  Longzhuang Li,et al.  A new method for automatic performance comparison of search engines , 2004, World Wide Web.

[78]  Ophir Frieder,et al.  Evaluation of filtering current news search results , 2004, SIGIR '04.

[79]  Andrew Turpin,et al.  Why batch and user evaluations do not give the same results , 2001, SIGIR '01.

[80]  Michael D. Gordon,et al.  Finding Information on the World Wide Web: The Retrieval Effectiveness of Search Engines , 1999, Inf. Process. Manag..

[81]  S. Goodman,et al.  A comment on replication, p-values and evidence. , 1992, Statistics in medicine.

[82]  J. MacKinnon,et al.  The power of bootstrap and asymptotic tests , 2006 .

[83]  Peter Bailey,et al.  Overview of the TREC-8 Web Track , 2000, TREC.