A case study in web search using TREC algorithms

Web search engines rank potentially relevant pages/sites for a user query. Ranking documents for user queries has also been at the heart of the Text REtrieval Conference (TREC in short) under the label ad-hoc retrieval. The TREC community has developed document ranking algorithms that are known to be the best for searching the document collections used in TREC, which are mainly comprised of newswire text. However, the web search community has developed its own methods to rank web pages/sites, many of which use link structure on the web, and are quite di erent from the algorithms developed at TREC. This study evaluates the performance of a state-of-the-art keyword-based document ranking algorithm (coming out of TREC) on a popular web search task: nding the web page/site of an entity, e.g. companies, universities, organizations, individuals, etc. This form of querying is quite prevalent on the web. The results from the TREC algorithms are compared to four commercial web search engines. Results show that for nding the web page/site of an entity, commercial web search engines are notably better than a state-of-the-art TREC algorithm. These results are in sharp contrast to results from several previous studies.

[1]  Jaideep Srivastava,et al.  First 20 Precision Among World Web Search Services (Search Engines) , 1999, J. Am. Soc. Inf. Sci..

[2]  AnnBritt Enochsson FINDING INFORMATION ON THE WORLD WIDE WEB , 1998 .

[3]  Donna K. Harman,et al.  Results and Challenges in Web Search Evaluation , 1999, Comput. Networks.

[4]  Ellen M. Voorhees,et al.  Overview of the seventh text retrieval conference (trec-7) [on-line] , 1999 .

[5]  Peter Bailey,et al.  Overview of the TREC-8 Web Track , 2000, TREC.

[6]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[7]  Amit Singhal,et al.  AT&T at TREC-7 , 1998, TREC.

[8]  William Goffman,et al.  On relevance as a measure , 1964, Inf. Storage Retr..

[9]  Amanda Spink,et al.  Real life information retrieval: a study of user queries on the Web , 1998, SIGF.

[10]  Craig Silverstein,et al.  Analysis of a Very Large Altavista Query Log" SRC Technical note #1998-14 , 1998 .

[11]  Karen Sparck Jones,et al.  Okapi at TREC{7: automatic ad hoc, ltering, VLC and interactive track , 1999 .

[12]  Ellen M. Voorhees,et al.  Overview of the Seventh Text REtrieval Conference , 1998 .

[13]  David Hawking,et al.  Overview of TREC-7 Very Large Collection Track , 1997, TREC.

[14]  SpinkAmanda,et al.  Real life information retrieval: a study of user queries on the Web , 1998 .

[15]  Donna K. Harman,et al.  Overview of the Eighth Text REtrieval Conference (TREC-8) , 1999, TREC.

[16]  Jaideep Srivastava,et al.  First 20 precision among World Wide Web search services (search engines) , 1999 .

[17]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[18]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[19]  Jacques Savoy,et al.  Report on the TREC-8 Experiment: Searching on the Web and in Distributed Collections , 1999, TREC.

[20]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[21]  Gerard Salton,et al.  Improving retrieval performance by relevance feedback , 1997, J. Am. Soc. Inf. Sci..

[22]  Kui-Lam Kwok,et al.  Improving two-stage ad-hoc retrieval for short queries , 1998, SIGIR '98.

[23]  Stephen E. Robertson,et al.  Okapi at TREC-7: Automatic Ad Hoc, Filtering, VLC and Interactive , 1998, TREC.

[24]  Amit Singhal,et al.  Pivoted document length normalization , 1996, SIGIR 1996.