Terrier takes on the non-English Web

The aim of this work is to identify how standard Information Retrieval (IR) techniques can be adapted in Web retrieval for non-English queries. In particular, we address the challenge of stemming queries and documents in a multilingual setting. Experiments with a multilingual collection of over 20 languages, more than 800 queries, and various stemming strategies in these languages reveal that using no stemming results in satisfactory Web retrieval performance, that is overall stable. Moreover, we show that languagespecific stemming requires an accurate identification of the language of each query.

[1]  M. de Rijke,et al.  Overview of WebCLEF 2005 , 2005, CLEF.

[2]  M. de Rijke,et al.  Automatic construction of known-item finding test beds , 2006, SIGIR.

[3]  Donna K. Harman,et al.  A failure analysis of the limitation of suffixing in an online environment , 1987, SIGIR '87.

[4]  Iadh Ounis,et al.  A study of the dirichlet priors for term frequency normalisation , 2005, SIGIR '05.

[5]  Gianni Amati,et al.  Probability models for information retrieval based on divergence from randomness , 2003 .

[6]  Craig MacDonald,et al.  University of Glasgow at WebCLEF 2005: Experiments in per-field Normalisation and Language Specific Stemming , 2005, CLEF.

[7]  M. de Rijke,et al.  Combination Methods for Crosslingual Web Retrieval , 2005, CLEF.

[8]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[9]  Iadh Ounis,et al.  University of Glasgow at TREC 2006: Experiments in Terabyte and Enterprise Tracks with Terrier , 2006, TREC.

[10]  Fredric C. Gey,et al.  Accessing Multilingual Information Repositories (6th Workshop of the Cross-Language Evaluation Forum, CLEF 2005) , 2006 .

[11]  José Luis Martínez-Fernández,et al.  MIRACLE at WebCLEF 2005: Combining Web Specific and Linguistic Information , 2005, CLEF.

[12]  David Hawking,et al.  Overview of the TREC 2004 Web Track , 2004, TREC.

[13]  Ben He,et al.  Terrier : A High Performance and Scalable Information Retrieval Platform , 2022 .

[14]  Amanda Spink,et al.  A day in the life of Web searching: an exploratory study , 2004, Inf. Process. Manag..

[15]  Ángel F. Zazo Rodríguez,et al.  Web Page Retrieval by Combining Evidence , 2005, CLEF.

[16]  Stephen Tomlinson,et al.  Danish and Greek Web Search Experiments with Hummingbird SearchServerTM at CLEF 2005 , 2005, CLEF.

[17]  Fredric C. Gey,et al.  Cross language information retrieval: a research roadmap , 2002, SIGF.

[18]  Stephen E. Robertson,et al.  Simple BM25 extension to multiple weighted fields , 2004, CIKM '04.

[19]  M. de Rijke,et al.  EuroGOV: Engineering a Multilingual Web Corpus , 2005, CLEF.

[20]  David Hawking,et al.  Overview of the TREC 2003 Web Track , 2003, TREC.

[21]  Fernando Llopis,et al.  University of Alicante at the CLEF 2005 WebCLEF Track , 2005, CLEF.

[22]  Mirna Adriani,et al.  Using the Web Information Structure for Retrieving Web Pages , 2005, CLEF.