论文信息 - Language Models for Searching in Web Corpora

Language Models for Searching in Web Corpora

We describe our participation in the TREC 2004 Web and Terabyte tracks. For the web track, we employ mixture language models based on document full-text, incoming anchortext, and documents titles, with a range of webcentric priors. We provide a detailed analysis of the effect on relevance of document length, URL structure, and link topology. The resulting web-centric priors are applied to three types of topics?distillation, home page, and named page?and improve effectiveness for all topic types, as well as for the mixed query set. For the terabyte track, we experimented with building an index just based on the document titles, or on the incoming anchor texts. Very selective indexing leads to a compact index that is effective in terms of early precision, catering for the typical web searcher behavior.

[1] Claire Cardie,et al. An Analysis of Statistical and Syntactic Phrases , 1997, RIAO.

[2] W. Bruce Croft,et al. An exploratory analysis of phrases in text retrieval , 2000, RIAO.

[3] Djoerd Hiemstra,et al. The Importance of Prior Probabilities for Entry Page Search , 2002, SIGIR '02.

[4] Edward A. Fox,et al. Combination of Multiple Searches , 1993, TREC.

[5] Jacques Savoy,et al. Term Proximity Scoring for Keyword-Based Retrieval Systems , 2003, ECIR.

[6] Joel L Fagan,et al. Experiments in Automatic Phrase Indexing For Document Retrieval: A Comparison of Syntactic and Non-Syntactic Methods , 1987 .

[7] Stephen E. Robertson,et al. On Collection Size and Retrieval Effectiveness , 2004, Information Retrieval.

[8] James P. Callan,et al. Combining document representations for known-item search , 2003, SIGIR.

[9] Gilad Mishne,et al. Using Wikipedia at the TREC QA Track , 2004, TREC.

[10] M. de Rijke,et al. Approaches to Robust and Web Retrieval , 2003, TREC.

[11] Gilad Mishne,et al. Boosting Web Retrieval through Query Operations , 2005, BNAIC.

[12] Avi Arampatzis,et al. An Evaluation of Linguistically-motivated Indexing Schemes , 2000 .