Improving Statistical Machine Translation Using Domain Bilingual Multiword Expressions

Multiword expressions (MWEs) have been proved useful for many natural language processing tasks. However, how to use them to improve performance of statistical machine translation (SMT) is not well studied. This paper presents a simple yet effective strategy to extract domain bilingual multiword expressions. In addition, we implement three methods to integrate bilingual MWEs to Moses, the state-of-the-art phrase-based machine translation system. Experiments show that bilingual MWEs could improve translation performance significantly.

[1]  Philipp Koehn,et al.  Clause Restructuring for Statistical Machine Translation , 2005, ACL.

[2]  Sophia Ananiadou,et al.  Extracting Nested Collocations , 1996, COLING.

[3]  Alexander H. Waibel,et al.  Improving Statistical Machine Translation in the Medical Domain using the Unified Medical Language system , 2004, COLING.

[4]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[5]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[6]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[7]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[8]  Alexander M. Fraser,et al.  Squibs and Discussions: Measuring Word Alignment Quality for Statistical Machine Translation , 2007, CL.

[9]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[10]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[11]  Rafael E. Banchs,et al.  Grouping Multi-word Expressions According to Part-Of-Speech in Statistical Machine Translation , 2006, Workshop On Multi-Word-Expressions In A Multilingual Context.

[12]  Rafael E. Banchs,et al.  Data Inferred Multi-word Expressions for Statistical Machine Translation , 2005 .

[13]  Afsaneh Fazly,et al.  Automatically Constructing a Lexicon of Verb Phrase Idiomatic Combinations , 2006, EACL.

[14]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[15]  Maosong Sun,et al.  Two-Character Chinese Word Extraction Based on Hybrid of Internal and Contextual Measures , 2003, SIGHAN.

[16]  Timothy Baldwin,et al.  An Empirical Model of Multiword Expression Decomposability , 2003, ACL 2003.

[17]  Baobao Chang,et al.  Extraction of Translation Unit from Chinese-English Parallel Corpora , 2002, SIGHAN@COLING.

[18]  Colin Bannard A Measure of Syntactic Flexibility for Automatically Identifying Multiword Expressions in Corpora , 2007 .

[19]  Stanley F. Chen,et al.  An empirical study of smoothing techniques for language modeling , 1999 .

[20]  Philip Resnik,et al.  Word-Based Alignment, Phrase-Based Translation: What’s the Link? , 2006, AMTA.

[21]  Timothy Baldwin,et al.  Noun-Noun Compound Machine Translation A Feasibility Study on Shallow Processing , 2003, Proceedings of the ACL 2003 workshop on Multiword expressions analysis, acquisition and treatment -.

[22]  Tim van de Cruys,et al.  Semantics-based Multiword Expression Extraction , 2007 .

[23]  Necip Fazil Ayan,et al.  Going Beyond AER: An Extensive Analysis of Word Alignments and Their Impact on MT , 2006, ACL.

[24]  Patrick Pantel,et al.  A Statistical Corpus-Based Term Extractor , 2001, Canadian Conference on AI.

[25]  Dawn Archer,et al.  Comparing and combining a semantic tagger and a statistical tool for MWE extraction , 2005, Comput. Speech Lang..

[26]  Chengqing Zong,et al.  Domain Adaptation for Statistical Machine Translation with Domain Dictionary and Monolingual Corpora , 2008, COLING.

[27]  Franz Josef Och,et al.  Statistical machine translation: from single word models to alignment templates , 2002 .

[28]  Roger K. Moore Computer Speech and Language , 1986 .