Improving Function Word Alignment with Frequency and Syntactic Information

In statistical word alignment for machine translation, function words usually cause poor aligning performance because they do not have clear correspondence between different languages. This paper proposes a novel approach to improve word alignment by pruning alignments of function words from an existing alignment model with high precision and recall. Based on monolingual and bilingual frequency characteristics, a language-independent function word recognition algorithm is first proposed. Then a group of carefully defined syntactic structures combined with content word alignments are used for further function word alignment pruning. The experimental results show that the proposed approach improves both the quality of word alignment and the performance of statistical machine translation on Chinese-to-English, German-to-English and French-to-English language pairs.

[1]  Hermann Ney,et al.  HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[2]  Nizar Habash,et al.  Using Shallow Syntax Information to Improve Word Alignment and Reordering for SMT , 2008, WMT@ACL.

[3]  Dan Tufis,et al.  Improved Lexical Alignment by Combining Multiple Reified Alignments , 2006, EACL.

[4]  Ulf Hermjakob,et al.  Improved Word Alignment with Statistics and Linguistic Heuristics , 2009, EMNLP.

[5]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[6]  Chris Callison-Burch,et al.  Open Source Toolkit for Statistical Machine Translation: Factored Translation Models and Lattice Decoding , 2006 .

[7]  Alexander M. Fraser,et al.  Getting the Structure Right for Word Alignment: LEAF , 2007, EMNLP-CoNLL.

[8]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[9]  Yanjun Ma,et al.  Tuning Syntactically Enhanced Word Alignment for Statistical Machine Translation , 2009, EAMT.

[10]  Eiichiro Sumita,et al.  Overview of the Patent Machine Translation Task at the NTCIR-10 Workshop , 2011, NTCIR.

[11]  Kevin Knight,et al.  Using Syntax to Improve Word Alignment Precision for Syntax-Based Machine Translation , 2008, WMT@ACL.

[12]  Christopher D. Manning,et al.  Multiword Expression Identification with Tree Substitution Grammars: A Parsing tour de force with French , 2011, EMNLP.

[13]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[14]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[15]  Christopher D. Manning,et al.  Parsing Three German Treebanks: Lexicalized and Unlexicalized Baselines , 2008 .

[16]  Dan Klein,et al.  Fast Exact Inference with a Factored Model for Natural Language Parsing , 2002, NIPS.

[17]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[18]  Roger Levy,et al.  Is it Harder to Parse Chinese, or the Chinese Treebank? , 2003, ACL.