Edinburgh’s Phrase-based Machine Translation Systems for WMT-14

This paper describes the University of Edinburgh’s (UEDIN) phrase-based submissions to the translation and medical translation shared tasks of the 2014 Workshop on Statistical Machine Translation (WMT). We participated in all language pairs. We have improved upon our 2013 system by i) using generalized representations, specifically automatic word clusters for translations out of English, ii) using unsupervised character-based models to translate unknown words in RussianEnglish and Hindi-English pairs, iii) synthesizing Hindi data from closely-related Urdu data, and iv) building huge language on the common crawl corpus.

[1]  Franz Josef Och,et al.  An Efficient Method for Determining Bilingual Word Classes , 1999, EACL.

[2]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[3]  Tony McEnery,et al.  EMILLE, A 67-Million Word Corpus of Indic Languages: Data Collection, Mark-up and Harmonisation , 2002, LREC.

[4]  Philipp Koehn,et al.  Empirical Methods for Compound Splitting , 2003, EACL.

[5]  Shankar Kumar,et al.  Minimum Bayes-Risk Decoding for Statistical Machine Translation , 2004, NAACL.

[6]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[7]  Philipp Koehn,et al.  Clause Restructuring for Statistical Machine Translation , 2005, ACL.

[8]  Chris Callison-Burch,et al.  Open Source Toolkit for Statistical Machine Translation: Factored Translation Models and Lattice Decoding , 2006 .

[9]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[10]  Philipp Koehn,et al.  Factored Translation Models , 2007, EMNLP.

[11]  Hua Wu,et al.  Pivot language approach for phrase-based statistical machine translation , 2007, ACL.

[12]  David Chiang,et al.  Forest Rescoring: Faster Decoding with Integrated Language Models , 2007, ACL.

[13]  Hitoshi Isahara,et al.  A Comparison of Pivot Methods for Phrase-Based Statistical Machine Translation , 2007, NAACL.

[14]  Philipp Koehn,et al.  Large and Diverse Language Models for Statistical Machine Translation , 2008, IJCNLP.

[15]  Phil Blunsom,et al.  Probabilistic Inference for Machine Translation , 2008, EMNLP.

[16]  Christopher D. Manning,et al.  A Simple and Effective Hierarchical Phrase Reordering Model , 2008, EMNLP.

[17]  Kevin Knight,et al.  11,001 New Features for Statistical Machine Translation , 2009, NAACL.

[18]  Philipp Koehn,et al.  Edinburgh’s Submission to all Tracks of the WMT 2009 Shared Task with Reordering and Speed Improvements to Moses , 2009, WMT@EACL.

[19]  William D. Lewis,et al.  Intelligent Selection of Language Model Training Data , 2010, ACL.

[20]  Nadir Durrani,et al.  Hindi-to-Urdu Machine Translation through Transliteration , 2010, ACL.

[21]  Nadir Durrani,et al.  A Joint Sequence Translation Model with Integrated Reordering , 2011, ACL.

[22]  Jianfeng Gao,et al.  Domain Adaptation via Pseudo In-Domain Data Selection , 2011, EMNLP.

[23]  Kenneth Heafield,et al.  KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[24]  Matt Post,et al.  Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing , 2012, WMT@NAACL-HLT.

[25]  Preslav Nakov,et al.  Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages , 2012, ACL.

[26]  Nadir Durrani,et al.  Edinburgh’s Machine Translation Systems for European Language Pairs , 2013, WMT@ACL.

[27]  Nadir Durrani,et al.  QCRI-MES Submission at WMT13: Using Transliteration Mining to Improve Statistical Machine Translation , 2013, WMT@ACL.

[28]  Nadir Durrani,et al.  Can Markov Models Over Minimal Translation Units Help Phrase-Based SMT? , 2013, ACL.

[29]  Philipp Koehn,et al.  Scalable Modified Kneser-Ney Language Model Estimation , 2013, ACL.

[30]  Noah A. Smith,et al.  A Simple, Fast, and Effective Reparameterization of IBM Model 2 , 2013, NAACL.

[31]  Philipp Koehn,et al.  Findings of the 2013 Workshop on Statistical Machine Translation , 2013, WMT@ACL.

[32]  Nadir Durrani,et al.  Model With Minimal Translation Units, But Decode With Phrases , 2013, HLT-NAACL.

[33]  Nadir Durrani,et al.  Edinburgh SLT and MT System Description for the IWSLT 2014 Evaluation , 2013 .

[34]  Nadir Durrani,et al.  Improving machine translation via triangulation and transliteration , 2014, EAMT.

[35]  Nadir Durrani,et al.  EU-BRIDGE MT: Combined Machine Translation , 2014, WMT@ACL.

[36]  Kenneth Heafield,et al.  N-gram Counts and Language Models from the Common Crawl , 2014, LREC.

[37]  Ondrej Bojar,et al.  HindEnCorp - Hindi-English and Hindi-only Corpus for Machine Translation , 2014, LREC.

[38]  Nadir Durrani,et al.  Investigating the Usefulness of Generalized Word Representations in SMT , 2014, COLING.

[39]  Nadir Durrani,et al.  Integrating an Unsupervised Transliteration Model into Statistical Machine Translation , 2014, EACL.