Edinburgh’s Machine Translation Systems for European Language Pairs

We validated various novel and recently proposed methods for statistical machine translation on 10 language pairs, using large data resources. We saw gains from optimizing parameters, training with sparse features, the operation sequence model, and domain adaptation techniques. We also report on utilizing a huge language model trained on 126 billion tokens. The annual machine translation evaluation campaign for European languages organized around the ACL Workshop on Statistical Machine Translation offers the opportunity to test recent advancements in machine translation in large data condition across several diverse language pairs. Building on our own developments and external contributions to the Moses open source toolkit, we carried out extensive experiments that, by early indications, led to a strong showing in the evaluation campaign. We would like to stress especially two contributions: the use of the new operation sequence model (Section 3) within Moses, and — in a separate unconstraint track submission — the use of a huge language model trained on 126 billion tokens with a new training tool (Section 4). 1 Initial System Development We start with systems (Haddow and Koehn, 2012) that we developed for the 2012 Workshop on Statistical Machine Translation (Callison-Burch et al., 2012). The notable features of these systems are: • Moses phrase-based models with mostly default settings • training on all available parallel data, including the large UN parallel data, the FrenchEnglish 109 parallel data and the LDC Gigaword data • very large tuning set consisting of the test sets from 2008-2010, with a total of 7,567 sentences per language • German–English with syntactic prereordering (Collins et al., 2005), compound splitting (Koehn and Knight, 2003) and use of factored representation for a POS target sequence model (Koehn and Hoang, 2007) • English–German with morphological target sequence model Note that while our final 2012 systems included subsampling of training data with modified Moore-Lewis filtering (Axelrod et al., 2011), we did not use such filtering at the starting point of our development. We will report on such filtering in Section 2. Moreover, our system development initially used the WMT 2012 data condition, since it took place throughout 2012, and we switched to WMT 2013 training data at a later stage. In this section, we report cased BLEU scores (Papineni et al., 2001) on newstest2011. 1.1 Factored Backoff (German–English) We have consistently used factored models in past WMT systems for the German–English language pairs to include POS and morphological target sequence models. But we did not use the factored decomposition of translation options into multiple mapping steps, since this usually lead to much slower systems with usually worse results. A good place, however, for factored decomposition is the handling of rare and unknown source words which have more frequent morphological variants (Koehn and Haddow, 2012a). Here, we used only factored backoff for unknown words, giving gains in BLEU of +.12 for German–English. 1.2 Tuning with k-best MIRA In preparation for training with sparse features, we moved away from MERT which is known to fall

[1]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[2]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[3]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[4]  Philipp Koehn,et al.  Empirical Methods for Compound Splitting , 2003, EACL.

[5]  Philipp Koehn,et al.  Clause Restructuring for Statistical Machine Translation , 2005, ACL.

[6]  Philipp Koehn,et al.  Factored Translation Models , 2007, EMNLP.

[7]  Thorsten Brants,et al.  Large Language Models in Machine Translation , 2007, EMNLP.

[8]  Joel D. Martin,et al.  Improving Translation Quality by Discarding Most of the Phrasetable , 2007, EMNLP.

[9]  M. Rey,et al.  11 , 001 New Features for Statistical Machine Translation , 2009 .

[10]  William D. Lewis,et al.  Intelligent Selection of Language Model Training Data , 2010, ACL.

[11]  Roland Kuhn,et al.  Discriminative Instance Weighting for Domain Adaptation in Statistical Machine Translation , 2010, EMNLP.

[12]  Nadir Durrani,et al.  A Joint Sequence Translation Model with Integrated Reordering , 2011, ACL.

[13]  Charles L. A. Clarke,et al.  Efficient and effective spam filtering and re-ranking for large web datasets , 2010, Information Retrieval.

[14]  Jianfeng Gao,et al.  Domain Adaptation via Pseudo In-Domain Data Selection , 2011, EMNLP.

[15]  Kenneth Heafield,et al.  KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[16]  Philipp Koehn,et al.  Towards Effective Use of Training Data in Statistical Machine Translation , 2012, WMT@NAACL-HLT.

[17]  Philipp Koehn,et al.  In Proceedings of the Tenth Conference of the Association for Machine Translation in the Americas (AMTA) , 2012 .

[18]  Philipp Koehn,et al.  Analysing the Effect of Out-of-Domain Data on SMT Systems , 2012, WMT@NAACL-HLT.

[19]  Hermann Ney,et al.  A simple and effective weighted phrase extraction for machine translation adaptation , 2012, IWSLT.

[20]  Rico Sennrich,et al.  Perplexity Minimization for Translation Model Domain Adaptation in Statistical Machine Translation , 2012, EACL.

[21]  Philipp Koehn,et al.  Findings of the 2012 Workshop on Statistical Machine Translation , 2012, WMT@NAACL-HLT.

[22]  Philipp Koehn Interpolated Backoff for Factored Translation Models , 2012, AMTA.

[23]  George F. Foster,et al.  Batch Tuning Strategies for Statistical Machine Translation , 2012, NAACL.

[24]  Barry Haddow,et al.  Applying Pairwise Ranked Optimisation to Improve the Interpolation of Translation Models , 2013, NAACL.

[25]  Nadir Durrani,et al.  Can Markov Models Over Minimal Translation Units Help Phrase-Based SMT? , 2013, ACL.

[26]  Philipp Koehn,et al.  Scalable Modified Kneser-Ney Language Model Estimation , 2013, ACL.

[27]  Nadir Durrani,et al.  Model With Minimal Translation Units, But Decode With Phrases , 2013, HLT-NAACL.