A tree does not make a well-formed sentence: Improving syntactic string-to-tree statistical machine translation with more linguistic knowledge

Abstract Synchronous context-free grammars (SCFGs) can be learned from parallel texts that are annotated with target-side syntax, and can produce translations by building target-side syntactic trees from source strings. Ideally, producing syntactic trees would entail that the translation is grammatically well-formed, but in reality, this is often not the case. Focusing on translation into German, we discuss various ways in which string-to-tree translation models over- or undergeneralise. We show how these problems can be addressed by choosing a suitable parser and modifying its output, by introducing linguistic constraints that enforce morphological agreement and constrain subcategorisation, and by modelling the productive generation of German compounds.

[1]  Christopher D. Manning,et al.  Parsing Three German Treebanks: Lexicalized and Unlexicalized Baselines , 2008 .

[2]  Chris Quirk,et al.  Discriminative, Syntactic Language Modeling through Latent SVMs , 2008 .

[3]  Helmut Schmid Efficient Parsing of Highly Ambiguous Context-Free Grammars with Bit Vectors , 2004, COLING.

[4]  Sara Stymne,et al.  Using a Grammar Checker for Evaluation and Postprocessing of Statistical Machine Translation , 2010, LREC.

[5]  Ulrich Heid,et al.  SMOR: A German Computational Morphology Covering Derivation, Composition and Inflection , 2004, LREC.

[6]  Nitin Madnani,et al.  E-rating Machine Translation , 2011, WMT@EMNLP.

[7]  Ondrej Dusek,et al.  DEPFIX: A System for Automatic Correction of Czech MT Outputs , 2012, WMT@NAACL-HLT.

[8]  Dan Klein,et al.  Improved Inference for Unlexicalized Parsing , 2007, NAACL.

[9]  Sara Stymne A Comparison of Merging Strategies for Translation of German Compounds , 2009, EACL.

[10]  Hieu Hoang,et al.  Improving statistical machine translation with linguistic information , 2011 .

[11]  Kilian A. Foth Eine umfassende Constraint-Dependenz-Grammatik des Deutschen , 2006 .

[12]  Alexander M. Fraser,et al.  Modeling Inflection and Word-Formation in SMT , 2012, EACL.

[13]  Brian Roark,et al.  Discriminative Syntactic Language Modeling for Speech Recognition , 2005, ACL.

[14]  Philipp Koehn,et al.  In Proceedings of the Tenth Conference of the Association for Machine Translation in the Americas (AMTA) , 2012 .

[15]  Daniel Marcu,et al.  Scalable Inference and Training of Context-Rich Syntactic Translation Models , 2006, ACL.

[16]  Hermann Ney,et al.  Discriminative Training and Maximum Entropy Models for Statistical Machine Translation , 2002, ACL.

[17]  Philipp Koehn,et al.  GHKM Rule Extraction and Scope-3 Parsing in Moses , 2012, WMT@NAACL-HLT.

[18]  Erhard W. Hinrichs,et al.  The Tüba-D/Z Treebank: Annotating German with a Context-Free Backbone , 2004, LREC.

[19]  Stephan Vogel,et al.  Parallel Implementations of Word Alignment Tool , 2008, SETQALNLP.

[20]  Jennifer Foster,et al.  Treebanks gone bad , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[21]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[22]  Stuart M. Shieber,et al.  Synchronous Tree-Adjoining Grammars , 1990, COLING.

[23]  George F. Foster,et al.  Batch Tuning Strategies for Statistical Machine Translation , 2012, NAACL.

[24]  Philipp Koehn,et al.  Empirical Methods for Compound Splitting , 2003, EACL.

[25]  Nadir Durrani,et al.  Edinburgh’s Phrase-based Machine Translation Systems for WMT-14 , 2014, WMT@ACL.

[26]  Alexander M. Fraser,et al.  How to Avoid Burning Ducks: Combining Linguistic Analysis and Corpus Statistics for German Compound Processing , 2010, WMT@ACL.

[27]  Daniel Gildea Parsers as language models for statistical machine translation , 2008 .

[28]  Sandra Kübler How Do Treebank Annotation Schemes Influence Parsing Results? Or How Not to Compare Apples And Oranges , 2005 .

[29]  Daniel Marcu,et al.  SPMT: Statistical Machine Translation with Syntactified Target Language Phrases , 2006, EMNLP.

[30]  Wojciech Skut,et al.  SYNTACTIC ANNOTATION OF A GERMAN NEWSPAPER CORPUS , 2003 .

[31]  Daniel Marcu,et al.  What’s in a translation rule? , 2004, NAACL.

[32]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[33]  Jennifer Foster Treebanks Gone Bad Parser Evaluation and Retraining using a Treebank of Ungrammatical Sentences , 2007 .

[34]  David Chiang,et al.  Learning to Translate with Source and Target Syntax , 2010, ACL.

[35]  Yannick Versley Parser evaluation across Text Types , 2005 .

[36]  Andreas Zollmann,et al.  Syntax Augmented Machine Translation via Chart Parsing , 2006, WMT@HLT-NAACL.

[37]  Alexander M. Fraser,et al.  Using subcategorization knowledge to improve case prediction for translation to German , 2013, ACL.

[38]  Alexander M. Fraser,et al.  A Smorgasbord of Features for Statistical Machine Translation , 2004, NAACL.

[39]  Sandra Kübler The PaGe 2008 Shared Task on Parsing German , 2008 .

[40]  Philipp Koehn,et al.  Edinburgh’s Syntax-Based Machine Translation Systems , 2013, WMT@ACL.

[41]  Sabine Brants,et al.  The TIGER Treebank , 2001 .

[42]  Philipp Koehn,et al.  Findings of the 2014 Workshop on Statistical Machine Translation , 2014, WMT@ACL.

[43]  Stephen Wan,et al.  GLEU: Automatic Evaluation of Sentence-Level Fluency , 2007, ACL.

[44]  Mark Hopkins,et al.  SCFG Decoding Without Binarization , 2010, EMNLP.

[45]  Mark Hopkins,et al.  Extraction Programs: A Unified Approach to Translation Rule Extraction , 2011, WMT@EMNLP.

[46]  Philipp Koehn,et al.  Scalable Modified Kneser-Ney Language Model Estimation , 2013, ACL.

[47]  Rico Sennrich,et al.  Zmorge: A German Morphological Lexicon Extracted from Wiktionary , 2014, LREC.

[48]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[49]  M. Rey,et al.  11 , 001 New Features for Statistical Machine Translation , 2009 .

[50]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[51]  Alon Lavie,et al.  Automatic Category Label Coarsening for Syntax-Based Machine Translation , 2011, SSST@ACL.

[52]  Christof Monz,et al.  Discriminative syntactic reranking for statistical machine translation , 2010 .

[53]  Philipp Koehn,et al.  Factored Translation Models , 2007, EMNLP.

[54]  Philipp Koehn,et al.  Agreement Constraints for Statistical Machine Translation into German , 2011, WMT@EMNLP.

[55]  Dan Klein,et al.  Parsing German with Latent Variable Grammars , 2008 .

[56]  Sebastian Riedel,et al.  The CoNLL 2007 Shared Task on Dependency Parsing , 2007, EMNLP.

[57]  Sara Stymne,et al.  Productive Generation of Compound Words in Statistical Machine Translation , 2011, WMT@EMNLP.

[58]  Alon Lavie,et al.  A General-Purpose Rule Extractor for SCFG-Based Machine Translation , 2011, SSST@ACL.

[59]  David Chiang,et al.  A Hierarchical Phrase-Based Model for Statistical Machine Translation , 2005, ACL.

[60]  Markus Freitag,et al.  Hierarchical Phrase-Based Translation with Jane 2 , 2012, Prague Bull. Math. Linguistics.

[61]  Gerold Schneider,et al.  Exploiting Synergies Between Open Resources for German Dependency Parsing, POS-tagging, and Morphological Analysis , 2013, RANLP.

[62]  Dan Klein,et al.  Transforming Trees to Improve Syntactic Convergence , 2012, EMNLP.

[63]  Philipp Koehn,et al.  Findings of the 2013 Workshop on Statistical Machine Translation , 2013, WMT@ACL.

[64]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.