Retrieval effectiveness of machine translated queries

This article describes and evaluates various information retrieval models used to search document collections written in English through submitting queries written in various other languages, either members of the Indo-European family (English, French, German, and Spanish) or radically different language groups such as Chinese. This evaluation method involves searching a rather large number of topics (around 300) and using two commercial machine translation systems to translate across the language barriers. In this study, mean average precision is used to measure variances in retrieval effectiveness when a query language differs from the document language. Although performance differences are rather large for certain languages pairs, this does not mean that bilingual search methods are not commercially viable. Causes of the difficulties incurred when searching or during translation are analyzed and the results of concrete examples are explained. © 2010 Wiley Periodicals, Inc.

[1]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[2]  Wingyan Chung,et al.  Web searching in a multilingual world , 2008, CACM.

[3]  Jian-Yun Nie,et al.  Using Statistical Translation Models for Bilingual IR , 2001, CLEF.

[4]  Carol Peters,et al.  CLEF 2003 Methodology and Metrics , 2003, CLEF.

[5]  Leo Wanner,et al.  Syntactic mismatches in machine translation , 2006, Machine Translation.

[6]  Jacques Savoy,et al.  Cross-language information retrieval: experiments based on CLEF 2000 corpora , 2003, Inf. Process. Manag..

[7]  Amanda Spink,et al.  How are we searching the World Wide Web? A comparison of nine search engine transaction logs , 2006, Inf. Process. Manag..

[8]  Jacques Savoy,et al.  Statistical inference in retrieval effectiveness evaluation , 1997, Inf. Process. Manag..

[9]  Djoerd Hiemstra,et al.  Translation Resources, Merging Strategies, and Relevance Feedback for Cross-Language Information Retrieval , 2000, CLEF.

[10]  Fredric C. Gey,et al.  ENSM-SE at CLEF 2006 : Fuzzy Proximity Method with an Adhoc Influence Function in Evaluation of Multilingual and Multi-modal Information Retrieval 7th Workshop of the Cross-Language Evaluation Forum, CLEF 2006, Alicante, Spain , 2007 .

[11]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[12]  Fredric C. Gey,et al.  Multilingual Information Retrieval Using Machine Translation, Relevance Feedback and Decompounding , 2004, Information Retrieval.

[13]  Djoerd Hiemstra,et al.  Using language models for information retrieval , 2001 .

[14]  Stephen E. Robertson,et al.  Experimentation as a way of life: Okapi at TREC , 2000, Inf. Process. Manag..

[15]  Turid Hedlund,et al.  Dictionary-Based Cross-Language Information Retrieval: Learning Experiences from CLEF 2000–2002 , 2004, Information Retrieval.

[16]  James Mayfield,et al.  Character N-Gram Tokenization for European Language Text Retrieval , 2004, Information Retrieval.

[17]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .