Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions

In this paper, we present a novel approach to enhance hierarchical phrase-based machine translation systems with linguistically motivated syntactic features. Rather than directly using treebank categories as in previous studies, we learn a set of linguistically-guided latent syntactic categories automatically from a source-side parsed, word-aligned parallel corpus, based on the hierarchical structure among phrase pairs as well as the syntactic structure of the source side. In our model, each X nonterminal in a SCFG rule is decorated with a real-valued feature vector computed based on its distribution of latent syntactic categories. These feature vectors are utilized at decoding time to measure the similarity between the syntactic analysis of the source side and the syntax of the SCFG rules that are applied to derive translations. Our approach maintains the advantages of hierarchical phrase-based translation systems while at the same time naturally incorporates soft syntactic constraints.

[1]  Bowen Zhou,et al.  Prior Derivation Models For Formally Syntax-Based Translation Using Linguistically Syntactic Parsing and Tree Kernels , 2008, SSST@ACL.

[2]  David Chiang,et al.  Hierarchical Phrase-Based Translation , 2007, CL.

[3]  Liang Huang,et al.  A Syntax-Directed Translator with Extended Domain of Locality , 2006 .

[4]  Daniel Gildea,et al.  Extracting Synchronous Grammar Rules From Word-Level Alignments in Linear Time , 2008, COLING.

[5]  Haizhou Li,et al.  Learning Translation Boundaries for Phrase-Based Decoding , 2010, NAACL.

[6]  Qun Liu,et al.  Forest-Based Translation , 2008, ACL.

[7]  Jun'ichi Tsujii,et al.  Probabilistic CFG with Latent Annotations , 2005, ACL.

[8]  Mary P. Harper,et al.  Self-Training PCFG Grammars with Latent Annotations Across Languages , 2009, EMNLP.

[9]  Andreas Zollmann,et al.  Syntax Augmented Machine Translation via Chart Parsing , 2006, WMT@HLT-NAACL.

[10]  Michel Habib,et al.  Revisiting T. Uno and M. Yagiura's Algorithm , 2005, ISAAC.

[11]  Hermann Ney,et al.  Improved Statistical Alignment Models , 2000, ACL.

[12]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[13]  Haizhou Li,et al.  A Syntax-Driven Bracketing Model for Phrase-Based Translation , 2009, ACL.

[14]  Yang Liu,et al.  Tree-to-String Alignment Template for Statistical Machine Translation , 2006, ACL.

[15]  Daniel Marcu,et al.  What’s in a translation rule? , 2004, NAACL.

[16]  Hermann Ney,et al.  The Alignment Template Approach to Statistical Machine Translation , 2004, CL.

[17]  Noah A. Smith,et al.  Preference Grammars: Softening Syntactic Constraints to Improve Statistical Machine Translation , 2009, NAACL.

[18]  Jens Stoye,et al.  Finding All Common Intervals of k Permutations , 2001, CPM.

[19]  Mary P. Harper,et al.  Self-Training with Products of Latent Variable Grammars , 2010, EMNLP.

[20]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[21]  David Chiang,et al.  Better k-best Parsing , 2005, IWPT.

[22]  Kevin Knight,et al.  A Syntax-based Statistical Translation Model , 2001, ACL.

[23]  Dekai Wu,et al.  Stochastic Inversion Transduction Grammars and Bilingual Parsing of Parallel Corpora , 1997, CL.

[24]  Dan Klein,et al.  Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.

[25]  Haitao Mi,et al.  Forest-based Translation Rule Extraction , 2008, EMNLP.

[26]  David Chiang,et al.  Learning to Translate with Source and Target Syntax , 2010, ACL.

[27]  Philip Resnik,et al.  Soft Syntactic Constraints for Hierarchical Phrased-Based Translation , 2008, ACL.

[28]  WuDekai Stochastic inversion transduction grammars and bilingual parsing of parallel corpora , 1997 .

[29]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[30]  Mary P. Harper,et al.  Improving A Simple Bigram HMM Part-of-Speech Tagger by Latent Annotation and Self-Training , 2009, NAACL.