论文信息 - Retrieving Web Pages Using Content, Links, URLs and Anchors

Retrieving Web Pages Using Content, Links, URLs and Anchors

For this year’s web track, we concentrated on the entry page finding task. For the content-only runs, in both the ad-hoc task and the entry page finding task, we used an information retrieval system based on a simple unigram language model. In the Ad hoc task we experimented with alternatieve approaches to smoothing. For the entry page task, we incorporated additional information into the model. The sources of information we used in addition to the document’s content are links, URLs and anchors. We found that almost every approach can improve the results of a content only run. In the end, a very basic approach, using the depth of the path of the URL as a prior, yielded by far the largest improvement over the content only results.

[1] Wessel Kraaij,et al. TNO-UT at TREC-9: How Different are Web Documents? , 2000, TREC.

[2] David Hawking,et al. Overview of the TREC-9 Web Track , 2000, TREC.

[3] John D. Lafferty,et al. A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[4] David Hawking,et al. Overview of the TREC-2001 Web track , 2002 .

[5] Sergey Brin,et al. The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[6] Wessel Kraaij,et al. Combining a mixture language model and Naive Bayes for multi-document summarisation , 2001 .

[7] Djoerd Hiemstra,et al. Twenty-One at TREC7: Ad-hoc and Cross-Language Track , 1998, TREC.

[8] Djoerd Hiemstra,et al. Using language models for information retrieval , 2001 .