Improving Statistical Word Alignments with Morpho-syntactic Transformations

This paper presents a wide range of statistical word alignment experiments incorporating morphosyntactic information. By means of parallel corpus transformations according to information of POS-tagging, lemmatization or stemming, we explore which linguistic information helps improve alignment error rates. For this, evaluation against a human word alignment reference is performed, aiming at an improved machine translation training scheme which eventually leads to improved SMT performance. Experiments are carried out in a Spanish–English European Parliament Proceedings parallel corpus, both in a large and a small data track. As expected, improvements due to introducing morphosyntactic information are bigger in case of data scarcity, but significant improvement is also achieved in a large data task, meaning that certain linguistic knowledge is relevant even in situations of large data availability.

[1]  Hermann Ney,et al.  A Comparison of Alignment Models for Statistical Machine Translation , 2000, COLING.

[2]  Jonas Kuhn Experiments in parallel-text based grammar induction , 2004, ACL.

[3]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[4]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[5]  José B. Mariño,et al.  TALP Phrase-based statistical translation system for European language pairs , 2006, WMT@HLT-NAACL.

[6]  Xavier Carreras,et al.  FreeLing: An Open-Source Suite of Language Analyzers , 2004, LREC.

[7]  Philip Resnik,et al.  An Unsupervised Method for Word Sense Tagging using Parallel Corpora , 2002, ACL.

[8]  José B. Mariño,et al.  Guidelines for Word Alignment Evaluation and Manual Alignment , 2005, Lang. Resour. Evaluation.

[9]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[10]  Vasileios Hatzivassiloglou,et al.  Translating Collocations for Bilingual Lexicons: A Statistical Approach , 1996, CL.

[11]  Adrià de Gispert,et al.  Phrase Linguistic Classification and Generalization for Improving Statistical Machine Translation , 2005, ACL.

[12]  José B. Mariño,et al.  Bilingual N-gram Statistical Machine Translation , 2005 .

[13]  Hermann Ney,et al.  Improving Word Alignment Quality using Morpho-syntactic Information , 2004, COLING.

[14]  Jörg Tiedemann,et al.  Combining Clues for Word Alignment , 2003, EACL.

[15]  Hermann Ney,et al.  POS-based Word Reorderings for Statistical Machine Translation , 2006, LREC.

[16]  Christopher D. Manning,et al.  Extentions to HMM-based Statistical Word Alignment Models , 2002, EMNLP.

[17]  Hermann Ney,et al.  Phrase-Based Statistical Machine Translation , 2002, KI.

[18]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[19]  Kevin Knight,et al.  A Syntax-based Statistical Translation Model , 2001, ACL.

[20]  Gerhard Lakemeyer,et al.  KI 2002: Advances in Artificial Intelligence , 2002, Lecture Notes in Computer Science.

[21]  David Yarowsky,et al.  Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora , 2001, HLT.