On the impact of morphology in English to Spanish statistical MT

This paper presents a thorough study of the impact of morphology derivation on N-gram-based Statistical Machine Translation (SMT) models from English into a morphology-rich language such as Spanish. For this purpose, we define a framework under the assumption that a certain degree of morphology-related information is not only being ignored by current statistical translation models, but also has a negative impact on their estimation due to the data sparseness it causes. Moreover, we describe how this information can be decoupled from the standard bilingual N-gram models and introduced separately by means of a well-defined and better informed feature-based classification task. Results are presented for the European Parliament Plenary Sessions (EPPS) English->Spanish task, showing oracle scores based on to what extent SMT models can benefit from simplifying Spanish morphological surface forms for each Part-Of-Speech category. We show that verb form morphological richness greatly weakens the standard statistical models, and we carry out a posterior morphology classification by defining a simple set of features and applying machine learning techniques. In addition to that, we propose a simple technique to deal with Spanish enclitic pronouns. Both techniques are empirically evaluated and final translation results show improvements over the baseline by just dealing with Spanish morphology. In principle, the study is also valid for translation from English into any other Romance language (Portuguese, Catalan, French, Galician, Italian, etc.). The proposed method can be applied to both monotonic and non-monotonic decoding scenarios, thus revealing the interaction between word-order decoding and the proposed morphology simplification techniques. Overall results achieve statistically significant improvement over baseline performance in this demanding task.

[1]  Hermann Ney,et al.  Statistical Machine Translation with Scarce Resources Using Morpho-syntactic Information , 2004, CL.

[2]  Philipp Koehn,et al.  Clause Restructuring for Statistical Machine Translation , 2005, ACL.

[3]  Harold L. Somers,et al.  An introduction to machine translation , 1992 .

[4]  Xavier Carreras,et al.  FreeLing: An Open-Source Suite of Language Analyzers , 2004, LREC.

[5]  David Yarowsky,et al.  Statistical Machine Translation: Final Report , 1999 .

[6]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[7]  José B. Mariño,et al.  Improving statistical machine translation by classifying and generalizing inflected verb forms , 2005, INTERSPEECH.

[8]  José B. Mariño,et al.  REORDERING EXPERIMENTS FOR N-GRAM-BASED SMT , 2006, 2006 IEEE Spoken Language Technology Workshop.

[9]  Hermann Ney,et al.  Improving SMT quality with morpho-syntactic analysis , 2000, COLING.

[10]  José B. Mariño,et al.  N-gram-based SMT System Enhanced with Reordering Patterns , 2006, WMT@HLT-NAACL.

[11]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[12]  Hermann Ney,et al.  Using POS information for statistical machine translation into morphologically rich languages , 2003, Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - EACL '03.

[13]  Hermann Ney,et al.  Towards the Use of Word Stems and Suffixes for Statistical Machine Translation , 2004, LREC.

[14]  A. B.,et al.  SPEECH COMMUNICATION , 2001 .

[15]  Hermann Ney,et al.  Error Analysis of Statistical Machine Translation Output , 2006, LREC.

[16]  José B. Mariño,et al.  Improving statistical MT by coupling reordering and decoding , 2006, Machine Translation.

[17]  Young-Suk Lee,et al.  Morphological Analysis for Statistical Machine Translation , 2004, NAACL.

[18]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[19]  Marcello Federico,et al.  Exploiting Word Transformation in Statistical Machine Translation from Spanish to English , 2006, EAMT.

[20]  Michael Gamon,et al.  Normalizing German and English inflectional morphology to improve statistical word alignment , 2004, AMTA.

[21]  José B. Mariño,et al.  Morpho-syntactic Information for Automatic Error Analysis of Statistical Machine Translation Output , 2006, WMT@HLT-NAACL.

[22]  Stephan Vogel,et al.  Bridging the Inflection Morphology Gap for Arabic Statistical Machine Translation , 2006, NAACL.

[23]  Philipp Koehn,et al.  Factored Translation Models , 2007, EMNLP.

[24]  José B. Mariño,et al.  N-gram-based Machine Translation , 2006, CL.

[25]  José B. Mariño,et al.  An n-gram-based statistical machine translation decoder , 2005, INTERSPEECH.

[26]  Thomas L. Griffiths,et al.  Contextual Dependencies in Unsupervised Word Segmentation , 2006, ACL.

[27]  Rafael E. Banchs,et al.  UPC's Bilingual N-gram Translation System , 2006 .

[28]  Xavier Carreras,et al.  A Simple Named Entity Extractor using AdaBoost , 2003, CoNLL.

[29]  Hermann Ney,et al.  Using POS information for statistical machine translation into morphologically rich languages , 2003, Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - EACL '03.

[30]  Miles Osborne,et al.  Modelling Lexical Redundancy for Machine Translation , 2006, ACL.

[31]  Hermann Ney,et al.  Error Analysis of Verb Inflections in Spanish Translation Output , 2006 .

[32]  Nizar Habash,et al.  Permission is granted to quote short excerpts and to reproduce figures and tables from this report, provided that the source of such material is fully acknowledged. Arabic Preprocessing Schemes for Statistical Machine Translation , 2006 .

[33]  Sharon Goldwater,et al.  Improving Statistical MT through Morphological Analysis , 2005, HLT.

[34]  Philipp Koehn,et al.  Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) , 2007 .

[35]  Francisco Casacuberta,et al.  Machine Translation with Inferred Stochastic Finite-State Transducers , 2004, Computational Linguistics.