论文信息 - Given Bilingual Terminology in Statistical Machine Translation: MWE-Sensitve Word Alignment and Hierarchical Pitman-Yor Process-Based Translation Model Smoothing - 字舞流文

Given Bilingual Terminology in Statistical Machine Translation: MWE-Sensitve Word Alignment and Hierarchical Pitman-Yor Process-Based Translation Model Smoothing

This paper considers a scenario when we are given almost perfect knowledge about bilingual terminology in terms of a test corpus in Statistical Machine Translation (SMT). When the given terminology is part of a training corpus, one natural strategy in SMT is to use the trained translation model ignoring the given terminology. Then, two questions arises here. 1) Can a word aligner capture the given terminology? This is since even if the terminology is in a training corpus, it is often the case that a resulted translation model may not include these terminology. 2) Are probabilities in a translation model correctly calculated? In order to answer these questions, we did experiment introducing a Multi-Word Expression-sensitive (MWE-sensitive) word aligner and a hierarchical Pitman-Yor process-based translation model smoothing. Using 200k JP--EN NTCIR corpus, our experimental results show that if we introduce an MWE-sensitive word aligner and a new translation model smoothing, the overall improvement was 1.35 BLEU point absolute and 6.0% relative compared to the case we do not introduce these two.

Andy Way | Tsuyoshi Okita | Andy Way | Tsuyoshi Okita

[1] Masao Utiyama,et al. Overview of the Patent Translation Task at the NTCIR-7 Workshop , 2008, NTCIR.

[2] Christopher D. Manning,et al. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[3] Andy Way,et al. Pitman-Yor Process-Based Language Models for Machine Translation , 2011, Int. J. Asian Lang. Process..

[4] Hermann Ney,et al. A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[5] Philipp Koehn,et al. Factored Translation Models , 2007, EMNLP.

[6] Andy Way,et al. Multi-Word Expression-Sensitive Word Alignment , 2010 .

[7] Julian Kupiec,et al. An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora , 1993, ACL.

[8] Franz Josef Och,et al. Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[9] Thomas L. Griffiths,et al. Contextual Dependencies in Unsupervised Word Segmentation , 2006, ACL.

[10] Haifeng Wang,et al. Paraphrases and Applications , 2010, COLING.

[11] Roland Kuhn,et al. Phrasetable Smoothing for Statistical Machine Translation , 2006, EMNLP.

[12] Yuji Matsumoto,et al. Fast Methods for Kernel-Based Text Analysis , 2003, ACL.

[13] Daniel Marcu,et al. Statistical Phrase-Based Translation , 2003, NAACL.

[14] David Chiang,et al. A Hierarchical Phrase-Based Model for Statistical Machine Translation , 2005, ACL.

[15] Andy Way,et al. Gap Between Theory and Practice: Noise Sensitive Word Alignment in Machine Translation , 2010, WAPA.

[16] Robert L. Mercer,et al. The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[17] Eiichiro Sumita. Lexical Transfer Using a Vector-Space Model , 2000, ACL.

[18] Ewan Klein,et al. Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics , 2000, ACL 2000.

[19] Philipp Koehn,et al. Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[20] Andreas Stolcke,et al. SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[21] Yee Whye Teh,et al. A Hierarchical Bayesian Language Model Based On Pitman-Yor Processes , 2006, ACL.

[22] Philipp Koehn,et al. Improved Statistical Machine Translation Using Paraphrases , 2006, NAACL.

[23] J. Pitman. Exchangeable and partially exchangeable random partitions , 1995 .

[24] Chris Callison-Burch,et al. Paraphrasing with Bilingual Parallel Corpora , 2005, ACL.

[25] Daniel Marcu,et al. A Phrase-Based,Joint Probability Model for Statistical Machine Translation , 2002, EMNLP.

[26] Tsuyoshi Okita,et al. Data Cleaning for Word Alignment , 2009, ACL.

[27] Naonori Ueda,et al. Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling , 2009, ACL.

[28] Philip Koehn,et al. Statistical Machine Translation , 2010, EAMT.