Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources

Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model. are highly affected by the presence of OOV words. The other way around, the number of source phrases covered during the translation is higher, but target sentences contain more incorrect translated words. Adding more data is the most obvious solution, but this has well-known drawbacks: it heavily increases the dimension of the tables, which reduces the translation speed, and parallel data are not always available for all the language pairs. In case of low quality parallel data, it can be even harmful because more data imply a bigger number of unreliable or incorrect associations built during the training phase. In this paper, we address the problem of expanding the knowledge of an SMT system without adding parallel data, but extending the knowledge produced during the training phase. The main idea consists of inserting artificial entries in the phrase and reordering models using external morphological resources; the goal is to provide more translation options to the system during the construction of the target sentence.

[1]  Tomaz Erjavec,et al.  MULTEXT-East Version 3: Multilingual Morphosyntactic Specifications, Lexicons and Corpora , 2004, LREC.

[2]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[3]  Joel D. Martin,et al.  Improving Translation Quality by Discarding Most of the Phrasetable , 2007, EMNLP.

[4]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[5]  Gholamreza Haffari,et al.  Active Learning for Statistical Phrase-based Machine Translation , 2009, NAACL.

[6]  Philipp Koehn,et al.  Findings of the 2009 Workshop on Statistical Machine Translation , 2009, WMT@EACL.

[7]  Philipp Koehn,et al.  Factored Translation Models , 2007, EMNLP.

[8]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[9]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments , 2007, WMT@ACL.

[10]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[11]  Chris Callison-Burch,et al.  Improved Statistical Machine Translation Using Monolingually-Derived Paraphrases , 2009, EMNLP.

[12]  Sharon Goldwater,et al.  Improving Statistical MT through Morphological Analysis , 2005, HLT.

[13]  José B. Mariño,et al.  Improving statistical machine translation by classifying and generalizing inflected verb forms , 2005, INTERSPEECH.

[14]  Nizar Habash,et al.  Four Techniques for Online Handling of Out-of-Vocabulary Words in Arabic-English Statistical Machine Translation , 2008, ACL.

[15]  Mei Yang,et al.  Phrase-Based Backoff Models for Machine Translation of Highly Inflected Languages , 2006, EACL.

[16]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[17]  Nello Cristianini,et al.  Learning Performance of a Machine Translation System: a Statistical and Computational Analysis , 2008, WMT@ACL.

[18]  Andy Way,et al.  Exploiting Parallel Treebanks to Improve Phrase-Based Statistical Machine Translation , 2009, CICLing.

[19]  Lluís Màrquez i Villodre,et al.  Enriching Statistical Translation Models Using a Domain-Independent Multilingual Lexical Knowledge Base , 2009, CICLing.

[20]  Mauro Cettolo,et al.  IRSTLM: an open source toolkit for handling large scale language models , 2008, INTERSPEECH.

[21]  Lucia Specia,et al.  Source-Language Entailment Modeling for Translating Unknown Terms , 2009, ACL.