DOCUMENT TRANSLATION RETRIEVAL BASED ON STATISTICAL MACHINE TRANSLATION TECHNIQUES

We compare different strategies to apply statistical machine translation techniques in order to retrieve documents that are a plausible translation of a given source document. Finding the translated version of a document is a relevant task; for example, when building a corpus of parallel texts that can help to create and evaluate new machine translation systems. In contrast to the traditional settings in cross-language information retrieval tasks, in this case both the source and the target text are long and, thus, the procedure used to select which words or phrases will be included in the query has a key effect on the retrieval performance. In the statistical approach explored here, both the probability of the translation and the relevance of the terms are taken into account in order to build an effective query.

[1]  Adam Kilgarriff,et al.  Introduction to the Special Issue on the Web as Corpus , 2003, CL.

[2]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[3]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[4]  J. Scott McCarley Should we Translate the Documents or the Queries in Cross-language Information Retrieval? , 1999, ACL.

[5]  A. Einstein,et al.  Über den Einfluß der Schwerkraft auf die Ausbreitung des Lichtes , 1911 .

[6]  Michael J. Cafarella,et al.  Building Nutch: Open Source Search , 2004, ACM Queue.

[7]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[8]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[9]  Djoerd Hiemstra,et al.  Disambiguation Strategies for Cross-Language Information Retrieval , 1999, ECDL.

[10]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[11]  Gregory Grefenstette,et al.  Cross-Language Information Retrieval , 1998, The Springer International Series on Information Retrieval.

[12]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[13]  John D. Lafferty,et al.  Information retrieval as statistical translation , 1999, SIGIR '99.

[14]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[15]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[16]  Kazuaki Kishida,et al.  Technical issues of cross-language information retrieval: a review , 2005, Inf. Process. Manag..

[17]  Marcello Federico,et al.  Statistical cross-language information retrieval using n-best query translations , 2002, SIGIR '02.

[18]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[19]  Mikel L. Forcada,et al.  Evaluation of Alignment Methods for HTML Parallel Text , 2006, FinTAL.

[20]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[21]  Hermann Ney,et al.  Discriminative Training and Maximum Entropy Models for Statistical Machine Translation , 2002, ACL.

[22]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[23]  Miles Osborne,et al.  Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[24]  Hermann Ney,et al.  Phrase-Based Statistical Machine Translation , 2002, KI.

[25]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.