论文信息 - Integrating Source-Language Context into Log-Linear Models of Statistical Machine Translation

Integrating Source-Language Context into Log-Linear Models of Statistical Machine Translation

The translation features typically used in state-of-the-art statistical machine translation (SMT) model dependencies between the source and target phrases, but not among the phrases in the source language themselves. A swathe of research has demonstrated that integrating source context modelling directly into log-linear phrase-based SMT (PB-SMT) and hierarchical PB-SMT (HPB-SMT), and can positively influence the weighting and selection of target phrases, and thus improve translation quality. In this thesis we present novel approaches to incorporate source-language contextual modelling into the state-of-the-art SMT models in order to enhance the quality of lexical selection. We investigate the effectiveness of use of a range of contextual features, including lexical features of neighbouring words, part-of-speech tags, supertags, sentence-similarity features, dependency information, and semantic roles. We explored a series of language pairs featuring typologically different languages, and examined the scalability of our research to larger amounts of training data. While our results are mixed across feature selections, language pairs, and learning curves, we observe that including contextual features of the source sentence in general produces improvements. The most significant improvements involve the integration of long-distance contextual features, such as dependency relations in combination with part-of-speech tags in Dutch-to-English subtitle translation, the combination of dependency parse and semantic role information in English-to-Dutch parliamentary debate translation, supertag features in English-to-Chinese translation, or combination of supertag and lexical features in English-to-Dutch subtitle translation. Furthermore, we investigate the applicability of our lexical contextual model in another closely related NLP problem, namely machine transliteration.

Rejwanul Haque | Rejwanul Haque

[1] Hermann Ney,et al. Extending Statistical Machine Translation with Discriminative and Trigger-Based Lexicon Models , 2009, EMNLP.

[2] Yanjun Ma,et al. Using Supertags as Source Language Context in SMT , 2009, EAMT.

[3] Srinivas Bangalore,et al. Automated extraction of Tree-Adjoining Grammars from treebanks , 2006, Nat. Lang. Eng..

[4] Andy Way,et al. Dependency Relations as Source Context in Phrase-Based SMT , 2009, PACLIC.

[5] John Cocke,et al. A Statistical Approach to Language Translation , 1988, COLING.

[6] Key-Sun Choi,et al. Automatic Transliteration and Back-transliteration by Decision Tree Learning , 2000, LREC.

[7] Hermann Ney,et al. A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[8] A. Kumaran,et al. A generic framework for machine transliteration , 2007, SIGIR.

[9] Chris Pike,et al. Scalable Purely-Discriminative Training for Word and Tree Transducers , 2006 .

[10] Ben Taskar,et al. An End-to-End Discriminative Approach to Machine Translation , 2006, ACL.

[11] Lalit R. Bahl,et al. A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.