论文信息 - Syntax-Based Statistical Machine Translation : A review

Syntax-Based Statistical Machine Translation : A review

Ever since the incipient of computers and the very first introduction of artificial intelligence, machine translation has been a target goal — or better said, a dream that at some point in the past deemed impossible (ALPAC 1966). The problem that machine translation aims to solve is very simple: given a document/sentence in a source language, produce its equivalent in the target language. This problem is complicated because of the inherent ambiguity of languages: the same word can have different meaning based on the context, idioms plus many other computational factors. Moreover extra domain knowledge is needed for a high quality output. Early techniques to solve this problem were human-intensive via parsing, transfer rules and generation with the help of an Interlingua (Hutchins 1995). This approach, while performing well in restricted domains, is not scalable and not suitable for languages that we do not have a syntactic theory/parser for. In the last decade, statistical techniques using the noisy channel model dominated the field and outperformed classical ones (Brown et al. 1993), however one problem with statistical methods is that they do not employ enough linguistic-theory to produce a grammatically coherent output(Och et al. 2003). This is because these methods incorporate little or no explicit syntactical theory and it only captures elements of syntax implicitly via the use of an n-gram language model in the noisy channel framework, which ca not model long dependencies. The goal of syntax-based machine translation techniques is to incorporate an explicit representation of syntax into the statistical systems to get the best out of the two worlds: high quality output while not requiring intensive human efforts. In this report we will give an overview of various approaches for syntax-aware statistical machine translation systems developed,or proposed, in the lase two decades. In our survey, we will stress the tension between the expressivity of the model and the complexity of its associated training and decoding procedures. The rest of this report is organized as follows: first, Section 2, gives a brief overview of the basic statistical machine translation model that serves as the basis of the subsequent discussions, and motivates the need for deploying syntax in the translation pipeline. In Section 3, we discuss various formal grammar formalisms which were proposed to model parallel texts. Then in section 4, we describe how these theoretical ideas have been used to augment the basic models in Section 2, and detail how the resulting models are trained from data, as well as assessing their complexity against the extra accuracy gained. Finally we conclude in Section 5

Amr Ahmed | Amr Ahmed

[1] Dekai Wu,et al. Stochastic Inversion Transduction Grammars and Bilingual Parsing of Parallel Corpora , 1997, CL.

[2] Yuan Ding,et al. Machine Translation Using Probabilistic Synchronous Dependency Insertion Grammars , 2005, ACL.

[3] William J. Byrne,et al. HMM Word and Phrase Alignment for Statistical Machine Translation , 2005, HLT.

[4] Hermann Ney,et al. Statistical Methods for Machine Translation , 2000 .

[5] Daniel Gildea,et al. Synchronous Binarization for Machine Translation , 2006, NAACL.

[6] Anoop Sarkar,et al. Discriminative Reranking for Machine Translation , 2004, NAACL.

[7] Daniel Marcu,et al. Scalable Inference and Training of Context-Rich Syntactic Translation Models , 2006, ACL.

[8] Aravind K. Joshi,et al. Tree-Adjoining Grammars , 1997, Handbook of Formal Languages.

[9] Michael Collins,et al. A Discriminative Model for Tree-to-Tree Translation , 2006, EMNLP.

[10] Daniel Gildea,et al. Loosely Tree-Based Alignment for Machine Translation , 2003, ACL.

[11] Aravind K. Joshi,et al. Using Lexicalized Tags for Machine Translation , 1990, COLING.