Supervised Morphology Generation Using Parallel Corpus

Translating from English, a morphologically poor language, into morphologically rich languages such as Persian comes with many challenges. In this paper, we present an approach to rich morphology prediction using a parallel corpus. We focus on the verb conjugation as the most important and problematic phenomenon in the context of morphology in Persian. We define a set of linguistic features using both English and Persian linguistic information, and use an English-Persian parallel corpus to train our model. Then, we predict six morphological features of the verb and generate inflected verb form using its lemma. In our experiments, we generate verb form with the most common feature values as a baseline. The results of our experiments show an improvement of almost 2.1% absolute BLEU score on a test set containing 16K sentences.

[1]  Kemal Oflazer,et al.  Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical Machine Translation from English to Turkish , 2010, ACL.

[2]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[3]  Philipp Koehn,et al.  Factored Translation Models , 2007, EMNLP.

[4]  Sharon Goldwater,et al.  Explorer Improving Statistical MT through Morphological Analysis , 2005 .

[5]  Nizar Habash,et al.  Translate, Predict or Generate: Modeling Rich Morphology in Statistical Machine Translation , 2012, EAMT.

[6]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[7]  Kemal Oflazer Statistical Machine Translation into a Morphologically Complex Language , 2008, CICLing.

[8]  Anoop Sarkar,et al.  Combining Morpheme-based Machine Translation with Post-processing Morpheme Prediction , 2011, ACL.

[9]  Nizar Habash,et al.  Rich Morphology Generation Using Statistical Machine Translation , 2012, INLG.

[10]  Kristina Toutanova,et al.  Applying Morphology Generation Models to Machine Translation , 2008, ACL.

[11]  Karine Megerdoomian,et al.  Finite-State Morphological Analysis of Persian , 2004 .

[12]  Kristina Toutanova,et al.  Generating Complex Morphology for Machine Translation , 2007, ACL.

[13]  Philipp Koehn,et al.  Enriching Morphologically Poor Languages for Statistical Machine Translation , 2008, ACL.

[14]  Preslav Nakov,et al.  A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages , 2010, EMNLP.

[15]  José B. Mariño,et al.  On the impact of morphology in English to Spanish statistical MT , 2008, Speech Commun..

[16]  A. Mansouri,et al.  State-of-the-art English to Persian Statistical Machine Translation system , 2012, The 16th CSI International Symposium on Artificial Intelligence and Signal Processing (AISP 2012).

[17]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[18]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[19]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[20]  Michael Subotin,et al.  An exponential translation model for target language morphology , 2011, ACL.

[21]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[22]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[23]  Heshaam Faili,et al.  Unsupervised Identification of Persian Compound Verbs , 2011, MICAI.