Language-Dependent and Language-Independent Approaches to Cross-Lingual Text Retrieval

We investigate the effectiveness of language-dependent approaches to document retrieval, such as stemming and decompounding, and constrast them with language-independent approaches, such as character n-gramming. In order to reap the benefits of more than one type of approach, we also consider the effectiveness of the combination of both types of approaches. We focus on document retrieval in nine European languages: Dutch, English, Finnish, French, German, Italian, Russian, Spanish and Swedish. We look at four different information retrieval tasks: monolingual, bilingual, multilingual, and domain-specific retrieval. The experimental evidence is obtained using the 2003 test suite of the cross-language evaluation forum (CLEF).

[1]  David A. Hull Stemming algorithms: a case study for detailed evaluation , 1996 .

[2]  Carol Peters,et al.  Evaluation of Cross-Language Information Retrieval Systems , 2002, Lecture Notes in Computer Science.

[3]  Maarten de Rijke,et al.  The University of Amsterdam at CLEF 2003 , 2001, CLEF.

[4]  James Mayfield,et al.  Character N-Gram Tokenization for European Language Text Retrieval , 2004, Information Retrieval.

[5]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[6]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[7]  Ren'ee Pohlmann Wessel Kraaij Improving the Precision of a Text Retrieval System with Compound Analysis , 1996 .

[8]  Edward A. Fox,et al.  Combination of Multiple Searches , 1993, TREC.

[9]  Djoerd Hiemstra,et al.  Using language models for information retrieval , 2001 .

[10]  Chris Buckley,et al.  New Retrieval Approaches Using SMART: TREC 4 , 1995, TREC.

[11]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[12]  Maarten de Rijke,et al.  Shallow Morphological Analysis in Monolingual Information Retrieval for Dutch, German, and Italian , 2001, CLEF.

[13]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[14]  Maarten de Rijke,et al.  Combining Evidence for Cross-Language Information Retrieval , 2002, CLEF.

[15]  Jacques Savoy,et al.  Combining Multiple Strategies for Effective Monolingual and Cross-Language Retrieval , 2004, Information Retrieval.

[16]  Donna K. Harman,et al.  How effective is suffixing? , 1991, J. Am. Soc. Inf. Sci..

[17]  Joon Ho Lee,et al.  Combining multiple evidence from different properties of weighting schemes , 1995, SIGIR '95.

[18]  Stephen E. Robertson,et al.  Experimentation as a way of life: Okapi at TREC , 2000, Inf. Process. Manag..

[19]  M. de Rijke,et al.  Monolingual Document Retrieval for European Languages , 2004, Information Retrieval.

[20]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[21]  William B. Frakes,et al.  Stemming Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.