论文信息 - Improved word alignments for statistical machine translation

Improved word alignments for statistical machine translation

All state of the art statistical machine translation systems and many example-based machine translation systems depend on an annotation of word-level translational correspondence between sets of parallel sentences. Such an annotation of two parallel sentences is called a "word alignment". The largest number of manually annotated word alignments currently available to the research community for any pair of languages consists of alignments for only thousands of parallel sentences, even though there are several orders of magnitude more parallel sentences available. For instance, for the task of translating Chinese news articles to English, there are currently on the order of 10 million parallel sentences. This is too many for manual alignment, so they must be automatically word aligned. Unsupervised word alignment systems generate poor quality alignments, often using statistical word alignment models developed over 10 years ago, but most recent research into improving word alignments has not led to improved translation. There are several reasons for this: (1) There is no good metric which can be used to automatically measure word alignment quality for the translation task. (2) Statistical word alignment models are based on assumptions about the structure of the problem which are incorrect. (3) It is difficult to add new sources of linguistic knowledge because many current systems must be completely reengineered for each new knowledge source. (4) Statistical models of word alignment are most often learned in an unsupervised training process which is unable to take advantage of annotated data. This thesis remedies these problems by making contributions in the following three areas: (1) We have found a new method for automatically measuring alignment quality using an unbalanced F-Measure metric. We have validated that this metric adequately measures alignment quality for the translation task. We have shown that the metric can be used to derive a loss function for discriminative training approaches, and it is useful for measuring progress during the development of new word alignment procedures. (2) We have designed a new statistical model for word alignment called LEAF, which directly models the word alignment structure as it is used for machine translation, in contrast with previous models which make unreasonable structural assumptions. (3) We have developed a semi-supervised training algorithm, the EMD algorithm, which automatically takes advantage of whatever quantity of manually annotated data can be obtained. The use of the EMD algorithm allows for the introduction of new knowledge sources with minimal effort. We have shown that these contributions improve state of the art statistical machine translation systems in experiments on challenging large data sets.

Daniel Marcu | Alexander Fraser

[1] Philip Resnik,et al. Evaluating Translational Correspondence using Annotation Projection , 2002, ACL.

[2] Ben Taskar,et al. Word Alignment via Quadratic Assignment , 2006, NAACL.

[3] Pascale Fung,et al. Mining Very-Non-Parallel Corpora: Parallel Sentence and Lexicon Extraction via Bootstrapping and E , 2004, EMNLP.

[4] Alexander M. Fraser,et al. Empirical studies in strategies for Arabic retrieval , 2002, SIGIR '02.

[5] Daniel Gildea,et al. Loosely Tree-Based Alignment for Machine Translation , 2003, ACL.

[6] Alexander H. Waibel,et al. Effective Phrase Translation Extraction from Alignment Models , 2003, ACL.

[7] Young-Suk Lee,et al. Morphological Analysis for Statistical Machine Translation , 2004, NAACL.

[8] Jörg Tiedemann,et al. Combining Clues for Word Alignment , 2003, EACL.

[9] Necip Fazil Ayan,et al. A Maximum Entropy Approach to Combining Word Alignments , 2006, NAACL.

[10] Robert C. Moore. Fast and accurate sentence alignment of bilingual corpora , 2002, AMTA.

[11] Kevin Knight,et al. A Syntax-based Statistical Translation Model , 2001, ACL.