Using percolated dependencies for phrase extraction in SMT

Statistical Machine Translation (SMT) systems rely heavily on the quality of the phrase pairs induced from large amounts of training data. Apart from the widely used method of heuristic learning of n-gram phrase translations from word alignments, there are numerous methods for extracting these phrase pairs. One such class of approaches uses translation information encoded in parallel treebanks to extract phrase pairs. Work to date has demonstrated the usefulness of translation models induced from both constituency structure trees and dependency structure trees. Both syntactic annotations rely on the existence of natural language parsers for both the source and target languages. We depart from the norm by directly obtaining dependency parses from constituency structures using head percolation tables. The paper investigates the use of aligned chunks induced from percolated dependencies in French–English SMT and contrasts it with the aforementioned extracted phrases. We observe that adding phrase pairs from any other method improves translation performance over the baseline n-gram-based system, percolated dependencies are a good substitute for parsed dependencies, and that supplementing with our novel head percolation-induced chunks shows a general trend toward improving all system types across two data sets up to a 5.26% relative increase in BLEU.

[1]  David M. Magerman Statistical Decision-Tree Models for Parsing , 1995, ACL.

[2]  Andy Way,et al.  Marker-Based Filtering of Bilingual Phrase Pairs for SMT , 2009, EAMT.

[3]  Fei Xia,et al.  Converting Dependency Structures to Phrase Structures , 2001, HLT.

[4]  Mary Hearne,et al.  Comparing Constituency and Dependency Representations for SMT Phrase-Extraction , 2008, JEPTALNRECITAL.

[5]  Daniel Marcu,et al.  Scalable Inference and Training of Context-Rich Syntactic Translation Models , 2006, ACL.

[6]  Dan Klein,et al.  Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.

[7]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[8]  David Chiang,et al.  A Hierarchical Phrase-Based Model for Statistical Machine Translation , 2005, ACL.

[9]  Andy Way,et al.  Exploiting Parallel Treebanks to Improve Phrase-Based Statistical Machine Translation , 2009, CICLing.

[10]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[11]  Adam Lopez,et al.  Translation as Weighted Deduction , 2009, EACL.

[12]  Joakim Nivre,et al.  Inductive Dependency Parsing , 2006, Text, speech and language technology.

[13]  Thomas R. G. Green,et al.  The necessity of syntax markers: Two experiments with artificial languages , 1979 .

[14]  Jean Véronis,et al.  Evaluation of multilingual text alignment systems: the ARCADE II project , 2006, LREC.

[15]  Andy Way,et al.  Syntactically Lexicalized Phrase-Based SMT , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[17]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[18]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[19]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[20]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[21]  Ventsislav Zhechev Unsupervised Generation of Parallel Treebanks through Sub-Tree Alignment , 2009, Prague Bull. Math. Linguistics.

[22]  Franz Josef Och,et al.  A Systematic Comparison of Phrase-Based, Hierarchical and Syntax-Augmented Statistical MT , 2008, COLING.

[23]  Chris Quirk,et al.  Dependency Treelet Translation: Syntactically Informed Phrasal SMT , 2005, ACL.

[24]  D. Bourigault,et al.  Syntex, analyseur syntaxique de corpus , 2005 .

[25]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[26]  Haim Gaifman,et al.  Dependency Systems and Phrase-Structure Systems , 1965, Inf. Control..

[27]  Andy Way,et al.  Hybrid Example-Based SMT: the Best of Both Worlds? , 2005, ParallelText@ACL.

[28]  Michael Collins,et al.  Three Generative, Lexicalised Models for Statistical Parsing , 1997, ACL.