Unsupervised Morphology Rivals Supervised Morphology for Arabic MT

If unsupervised morphological analyzers could approach the effectiveness of supervised ones, they would be a very attractive choice for improving MT performance on low-resource inflected languages. In this paper, we compare performance gains for state-of-the-art supervised vs. unsupervised morphological analyzers, using a state-of-the-art Arabic-to-English MT system. We apply maximum marginal decoding to the unsupervised analyzer, and show that this yields the best published segmentation accuracy for Arabic, while also making segmentation output more stable. Our approach gives an 18% relative BLEU gain for Levantine dialectal Arabic. Furthermore, it gives higher gains for Modern Standard Arabic (MSA), as measured on NIST MT-08, than does MADA (Habash and Rambow, 2005), a leading supervised MSA segmenter.

[1]  Mark Johnson,et al.  Improving nonparameteric Bayesian inference: experiments on unsupervised word segmentation with adaptor grammars , 2009, NAACL.

[2]  Hoifung Poon,et al.  Unsupervised Morphological Segmentation with Log-Linear Models , 2009, NAACL.

[3]  Mirella Lapata,et al.  Proceedings of ACL-08: HLT , 2008 .

[4]  Nizar Habash,et al.  Four Techniques for Online Handling of Out-of-Vocabulary Words in Arabic-English Statistical Machine Translation , 2008, ACL.

[5]  Mei Yang,et al.  Phrase-Based Backoff Models for Machine Translation of Highly Inflected Languages , 2006, EACL.

[6]  Mathias Creutz,et al.  Unsupervised models for morpheme segmentation and morphology learning , 2007, TSLP.

[7]  Jinxi Xu,et al.  A New String-to-Dependency Machine Translation Algorithm with a Target Dependency Language Model , 2008, ACL.

[8]  Hwee Tou Ng,et al.  Translating from Morphologically Complex Languages: A Paraphrase-Based Approach , 2011, ACL.

[9]  Stergios B. Fotopoulos,et al.  All of Nonparametric Statistics , 2007, Technometrics.

[10]  Kristina Toutanova,et al.  Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models , 2011, ACL.

[11]  Regina Barzilay,et al.  Modeling Syntactic Context Improves Morphological Segmentation , 2011, CoNLL.

[12]  Preslav Nakov,et al.  A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages , 2010, EMNLP.

[13]  James R. Glass,et al.  Segmentation for English-to-Arabic Statistical Machine Translation , 2008, ACL.

[14]  Mathias Creutz,et al.  Morphology-aware statistical machine translation based on morphs induced in an unsupervised manner , 2007, MTSUMMIT.

[15]  Larry Wasserman,et al.  All of Nonparametric Statistics (Springer Texts in Statistics) , 2006 .

[16]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[17]  Anoop Sarkar,et al.  Combining Morpheme-based Machine Translation with Post-processing Morpheme Prediction , 2011, ACL.

[18]  Philipp Koehn,et al.  Factored Translation Models , 2007, EMNLP.

[19]  Nizar Habash,et al.  Combination of Arabic Preprocessing Schemes for Statistical Machine Translation , 2006, ACL.

[20]  Coskun Mermer,et al.  Unsupervised Search for the Optimal Segmentation for Statistical Machine Translation , 2010, ACL.

[21]  Murat Saraclar,et al.  Unsupervised Turkish Morphological Segmentation for Statistical Machine Translation , 2010 .

[22]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[23]  Kemal Oflazer,et al.  Two-level Description of Turkish Morphology , 1993, EACL.

[24]  John Makhoul,et al.  Methods for integrating rule-based and statistical systems for Arabic to English machine translation , 2012, Machine Translation.

[25]  Chris Callison-Burch,et al.  Machine Translation of Arabic Dialects , 2012, NAACL.

[26]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[27]  Philipp Koehn,et al.  Enriching Morphologically Poor Languages for Statistical Machine Translation , 2008, ACL.