Using titles and category names from editor-driven taxonomies for automatic evaluation

Evaluation of IR systems has always been difficult because of the need for manually assessed relevance judgments. The advent of large editor-driven taxonomies on the web opens the door to a new evaluation approach. We use the ODP (Open Directory Project) taxonomy to find sets of pseudo-relevant documents via one of two assumptions: 1) taxonomy entries are relevant to a given query if their editor-entered titles exactly match the query, or 2) all entries in a leaf-level taxonomy category are relevant to a given query if the category title exactly matches the query. We compare and contrast these two methodologies by evaluating six web search engines on a sample from an America Online log of ten million web queries, using MRR measures for the first method and precision-based measures for the second. We show that this technique is stable with respect to the query set selected and correlated with a reasonably large manual evaluation.

[1]  Amit Singhal,et al.  A case study in web search using TREC algorithms , 2001, WWW '01.

[2]  Peter Bailey,et al.  Is it fair to evaluate Web systems using TREC ad hoc methods , 1999, SIGIR 1999.

[3]  Ophir Frieder,et al.  Using manually-built web directories for automatic evaluation of known-item retrieval , 2003, SIGIR.

[4]  Jaideep Srivastava,et al.  First 20 precision among World Wide Web search services (search engines) , 1999 .

[5]  Gary Marchionini,et al.  A Comparative Study of Web Search Service Performance , 1996 .

[6]  David Hawking,et al.  Overview of the TREC-2001 Web track , 2002 .

[7]  Filippo Menczer,et al.  Semi-Supervised Evaluation of Search Engines via Semantic Mapping , 2003 .

[8]  Dayne Freitag,et al.  A Machine Learning Architecture for Optimizing Web Search Engines , 1999 .

[9]  Abdur Chowdhury,et al.  Automatic evaluation of world wide web search services , 2002, SIGIR '02.

[10]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[11]  David Hawking,et al.  Overview of the TREC-2002 Web Track , 2002, TREC.

[12]  David Hawking,et al.  Which Search Engine is Best at Finding Online Services? , 2001, WWW Posters.

[13]  Donna K. Harman,et al.  Results and Challenges in Web Search Evaluation , 1999, Comput. Networks.

[14]  Ellen M. Voorhees,et al.  Evaluating Evaluation Measure Stability , 2000, SIGIR 2000.

[15]  Longzhuang Li,et al.  Precision Evaluation of Search Engines , 2004, World Wide Web.

[16]  Dan Klein,et al.  Evaluating strategies for similarity search on the web , 2002, WWW '02.

[17]  Peter Bailey,et al.  Overview of the TREC-8 Web Track , 2000, TREC.

[18]  David Hawking,et al.  Which search engine is best at finding airline site home pages , 2001 .

[19]  Andrew Turpin,et al.  Why batch and user evaluations do not give the same results , 2001, SIGIR '01.

[20]  Michael D. Gordon,et al.  Finding Information on the World Wide Web: The Retrieval Effectiveness of Search Engines , 1999, Inf. Process. Manag..

[21]  Ellen M. Voorhees,et al.  Evaluation by highly relevant documents , 2001, SIGIR '01.

[22]  David Hawking,et al.  Overview of TREC-7 Very Large Collection Track , 1997, TREC.

[23]  Amanda Spink,et al.  Real life, real users, and real needs: a study and analysis of user queries on the web , 2000, Inf. Process. Manag..

[24]  Amanda Spink,et al.  From E-Sex to E-Commerce: Web Search Changes , 2002, Computer.

[25]  Peter Bruza,et al.  Interactive Internet search: keyword, directory and query reformulation mechanisms compared , 2000, SIGIR '00.

[26]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[27]  Peter Bailey,et al.  Measuring Search Engine Quality , 2001, Information Retrieval.

[28]  Longzhuang Li,et al.  A new method for automatic performance comparison of search engines , 2004, World Wide Web.

[29]  Ellen M. Voorhees Variations in relevance judgments and the measurement of retrieval effectiveness , 2000, Inf. Process. Manag..

[30]  Thorsten Joachims,et al.  Evaluating Retrieval Performance Using Clickthrough Data , 2003, Text Mining.