Parser Self-Training for Syntax-Based Machine Translation

In syntax-based machine translation, it is known that the accuracy of parsing greatly affects the translation accuracy. Self-training, which uses parser output as training data, is one method to improve the parser accuracy. However, because parsing errors cause noisy data to be mixed with the training data, automatically generated parse trees do not always contribute to improving accuracy. In this paper, we propose a method for selecting self-training data by performing syntaxbased machine translation using a variety of parse trees, using automatic evaluation metrics to select which translation is better, and using that translation’s parse tree for parser selftraining. This method allows us to automatically choose the trees that contribute to improving translation accuracy, improving the effectiveness of self-training. In experiments, we found that our self-trained parsers significantly improve a state-of-the-art syntax-based machine translation system in two language pairs.

[1]  Qun Liu,et al.  Forest-Based Translation , 2008, ACL.

[2]  Graham Neubig,et al.  Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers , 2013, ACL.

[3]  Eiichiro Sumita,et al.  Overview of the 1st Workshop on Asian Translation , 2014, WAT.

[4]  Kevin Knight,et al.  Training Tree Transducers , 2004, NAACL.

[5]  Slav Petrov,et al.  Training a Parser for Machine Translation Reordering , 2011, EMNLP.

[6]  Chin-Yew Lin,et al.  ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation , 2004, COLING.

[7]  Germán Sanchis-Trilles,et al.  Does more data always yield better translations? , 2012, EACL.

[8]  Eugene Charniak,et al.  Effective Self-Training for Parsing , 2006, NAACL.

[9]  Kevin Knight,et al.  A Syntax-based Statistical Translation Model , 2001, ACL.

[10]  Haitao Mi,et al.  Forest-based Translation Rule Extraction , 2008, EMNLP.

[11]  Hui Zhang,et al.  An Exploration of Forest-to-String Translation: Does Translation Help or Hurt Parsing? , 2012, ACL.

[12]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[13]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[14]  Eugene Charniak,et al.  Statistical Parsing with a Context-Free Grammar and Word Statistics , 1997, AAAI/IAAI.

[15]  Yang Liu,et al.  Tree-to-String Alignment Template for Statistical Machine Translation , 2006, ACL.

[16]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[17]  Graham Neubig,et al.  Forest-to-String SMT for Asian Language Translation: NAIST at WAT 2014 , 2014, WAT.

[18]  Kevin Duh,et al.  On the Elements of an Accurate Tree-to-String Machine Translation System , 2014, ACL.

[19]  Daniel Marcu,et al.  Scalable Inference and Training of Context-Rich Syntactic Translation Models , 2006, ACL.

[20]  Kevin Duh,et al.  Automatic Evaluation of Translation Quality for Distant Language Pairs , 2010, EMNLP.

[21]  Ming Zhou,et al.  Re-training Monolingual Parser Bilingually for Syntactic SMT , 2012, EMNLP.

[22]  Mary P. Harper,et al.  Self-Training PCFG Grammars with Latent Annotations Across Languages , 2009, EMNLP.

[23]  Hiroshi Ichikawa,et al.  A Lightweight Evaluation Framework for Machine Translation Reordering , 2011, WMT@EMNLP.

[24]  Eugene Charniak,et al.  When is Self-Training Effective for Parsing? , 2008, COLING.

[25]  Shinsuke Mori,et al.  A Japanese Word Dependency Corpus , 2014, LREC.

[26]  Tomoki Toda,et al.  Ckylark: A More Robust PCFG-LA Parser , 2015, HLT-NAACL.

[27]  Eiichiro Sumita,et al.  Overview of the Patent Machine Translation Task at the NTCIR-10 Workshop , 2011, NTCIR.

[28]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.