A Multi-Domain Translation Model Framework for Statistical Machine Translation

While domain adaptation techniques for SMT have proven to be effective at improving translation quality, their practicality for a multi-domain environment is often limited because of the computational and human costs of developing and maintaining multiple systems adapted to different domains. We present an architecture that delays the computation of translation model features until decoding, allowing for the application of mixture-modeling techniques at decoding time. We also describe a method for unsupervised adaptation with development and test data from multiple domains. Experimental results on two language pairs demonstrate the effectiveness of both our translation model architecture and automatic clustering, with gains of up to 1 BLEU over unadapted systems and single-domain adaptation.

[1]  Maria Leonor Pacheco,et al.  of the Association for Computational Linguistics: , 2001 .

[2]  Francisco Casacuberta,et al.  Online Learning for Interactive Statistical Machine Translation , 2010, NAACL.

[3]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[4]  Holger Schwenk,et al.  A General Framework to Weight Heterogeneous Parallel Data for Model Adaptation in Statistical MT , 2012, AMTA.

[5]  Spyridon Matsoukas,et al.  Discriminative Corpus Weight Estimation for Machine Translation , 2009, EMNLP.

[6]  Andreas Eisele,et al.  DGT-TM: A freely available Translation Memory in 22 languages , 2012, LREC.

[7]  Rico Sennrich,et al.  MT-based Sentence Alignment for OCR-generated Parallel Texts , 2010, AMTA.

[8]  Rico Sennrich,et al.  Perplexity Minimization for Translation Model Domain Adaptation in Statistical Machine Translation , 2012, EACL.

[9]  Jörg Tiedemann,et al.  News from OPUS — A collection of multilingual parallel corpora with tools and interfaces , 2009 .

[10]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[11]  Alex Waibel,et al.  Domain Adaptation in Statistical Machine Translation using Factored Translation Models , 2010, EAMT.

[12]  Roland Kuhn,et al.  Discriminative Instance Weighting for Domain Adaptation in Statistical Machine Translation , 2010, EMNLP.

[13]  Rico Sennrich,et al.  Mixture-Modeling with Unsupervised Clusters for Domain Adaptation in Statistical Machine Translation , 2012, EAMT.

[14]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[15]  Rico Sennrich,et al.  Iterative, MT-based Sentence Alignment of Parallel Texts , 2011, NODALIDA.

[16]  Anoop Sarkar,et al.  Mixing Multiple Translation Models in Statistical Machine Translation , 2012, ACL.

[17]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[18]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[19]  Holger Schwenk,et al.  Translation Model Adaptation by Resampling , 2010, WMT@ACL.

[20]  Ondrej Bojar,et al.  CzEng 0.9: Large Parallel Treebank with Rich Annotation , 2009, Prague Bull. Math. Linguistics.

[21]  Alon Lavie,et al.  One System, Many Domains: Open-Domain Statistical Machine Translation via Feature Augmentation , 2012, AMTA.

[22]  Roland Kuhn,et al.  Mixture-Model Adaptation for SMT , 2007, WMT@ACL.

[23]  Tomaz Erjavec,et al.  The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[24]  Hal Daumé,et al.  Domain Adaptation for Machine Translation by Mining Unseen Words , 2011, ACL.

[25]  Andy Way,et al.  Combining Multi-Domain Statistical Machine Translation Models using Automatic Classifiers , 2010, AMTA.

[26]  Philipp Koehn,et al.  Experiments in Domain Adaptation for Statistical Machine Translation , 2007, WMT@ACL.

[27]  Arianna Bisazza,et al.  Fill-up versus interpolation methods for phrase-based SMT adaptation , 2011, IWSLT.

[28]  Eiichiro Sumita,et al.  Bilingual Cluster Based Models for Statistical Machine Translation , 2007, EMNLP.

[29]  Philipp Koehn,et al.  Findings of the 2012 Workshop on Statistical Machine Translation , 2012, WMT@NAACL-HLT.

[30]  Preslav Nakov,et al.  Improving English-Spanish Statistical Machine Translation: Experiments in Domain Adaptation, Sentence Paraphrasing, Tokenization, and Recasing , 2008, WMT@ACL.

[31]  Alon Lavie,et al.  Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability , 2011, ACL.