论文信息 - University of Glasgow at WebCLEF 2005: Experiments in per-field Normalisation and Language Specific Stemming

University of Glasgow at WebCLEF 2005: Experiments in per-field Normalisation and Language Specific Stemming

We participated in the WebCLEF 2005 monolingual task. In this task, a search system aims to retrieve relevant documents from a multilingual corpus of Web documents from Web sites of European governments. Both the documents and the queries are written in a wide range of European languages. A challenge in this setting is to detect the language of documents and topics, and to process them appropriately. We develop a language specific technique for applying the correct stemming approach, as well as for removing the correct stopwords from the queries. We represent documents using three fields, namely content, title, and anchor text of incoming hyperlinks. We use a technique called per-field normalisation, which extends the Divergence From Randomness (DFR) framework, to normalise the term frequencies, and to combine them across the three fields. We also employ the length of the URL path of Web documents. The ranking is based on combinations of both the language specific stemming, if applied, and the per-field normalisation. We use our Terrier platform for all our experiments. The overall performance of our techniques is outstanding, achieving the overall top four performing runs, as well as the top performing run without metadata in the monolingual task. The best run only uses per-field normalisation, without applying stemming.

[1] David Hawking,et al. Overview of the TREC 2004 Web Track , 2004, TREC.

[2] Daniel E. Rose,et al. Understanding user goals in web search , 2004, WWW '04.

[3] M. de Rijke,et al. EuroGOV: Engineering a Multilingual Web Corpus , 2005, CLEF.

[4] Gianni Amati,et al. Probability models for information retrieval based on divergence from randomness , 2003 .

[5] W. B. Cavnar,et al. N-gram-based text categorization , 1994 .

[6] Fredric C. Gey,et al. Accessing Multilingual Information Repositories, 6th Workshop of the Cross-Language Evalution Forum, CLEF 2005, Vienna, Austria, 21-23 September, 2005, Revised Selected Papers , 2006, CLEF.

[7] Iadh Ounis,et al. A study of the dirichlet priors for term frequency normalisation , 2005, SIGIR '05.

[8] David Hawking,et al. Overview of the TREC 2003 Web Track , 2003, TREC.

[9] David Hawking,et al. Toward better weighting of anchors , 2004, SIGIR '04.

[10] Iadh Ounis,et al. University of Glasgow at TREC 2004: Experiments in Web, Robust, and Terabyte Tracks with Terrier , 2004, TREC.

[11] Stephen E. Robertson,et al. Microsoft Cambridge at TREC 13: Web and Hard Tracks , 2004, TREC.

[12] M. de Rijke,et al. Monolingual Document Retrieval for European Languages , 2004, Information Retrieval.

[13] Amanda Spink,et al. An analysis of Web searching by European AlltheWeb.com users , 2005, Inf. Process. Manag..

[14] Craig MacDonald,et al. Terrier Information Retrieval Platform , 2005, ECIR.