论文信息 - Domain Adaptation via Pseudo In-Domain Data Selection

Domain Adaptation via Pseudo In-Domain Data Selection

We explore efficient domain adaptation for the task of statistical machine translation based on extracting sentences from a large general-domain parallel corpus that are most relevant to the target domain. These sentences may be selected with simple cross-entropy based methods, of which we present three. As these sentences are not themselves identical to the in-domain data, we call them pseudo in-domain subcorpora. These subcorpora -- 1% the size of the original -- can then used to train small domain-adapted Statistical Machine Translation (SMT) systems which outperform systems trained on the entire corpus. Performance is further improved when we use these domain-adapted models in combination with a true in-domain model. The results show that more training data is not always better, and that best results are attained via proper domain-relevant data selection, as well as combining in- and general-domain systems during decoding.

[1] F ChenStanley,et al. An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[2] Andreas Stolcke,et al. SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[3] Jianfeng Gao,et al. Toward a unified approach to statistical language modeling for Chinese , 2002, TALIP.

[4] Franz Josef Och,et al. Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[5] Hermann Ney,et al. A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[6] Alexander H. Waibel,et al. Language Model Adaptation for Statistical Machine Translation Based on Information Retrieval , 2004, LREC.

[7] Amittai Axelrod,et al. Factored Language Models for Statistical Machine Translation , 2006 .

[8] Roland Kuhn,et al. Mixture-Model Adaptation for SMT , 2007, WMT@ACL.

[9] Philipp Koehn,et al. Experiments in Domain Adaptation for Statistical Machine Translation , 2007, WMT@ACL.

[10] Xiaodong He. Using Word-Dependent Transition Models in HMM-Based Word Alignment for Statistical Machine Translation , 2007, WMT@ACL.

[11] Philipp Koehn,et al. Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[12] Qun Liu,et al. Improving Statistical Machine Translation Performance by Training Data Selection and Optimization , 2007, EMNLP-CoNLL.

[13] Philipp Koehn,et al. CCG Supertags in Factored Statistical Machine Translation , 2007, WMT@ACL.

[14] Preslav Nakov,et al. Improving English-Spanish Statistical Machine Translation: Experiments in Domain Adaptation, Sentence Paraphrasing, Tokenization, and Recasing , 2008, WMT@ACL.

[15] Eiichiro Sumita,et al. Method of Selecting Training Data to Build a Compact and Efficient Translation Model , 2008, IJCNLP.

[16] Spyridon Matsoukas,et al. Discriminative Corpus Weight Estimation for Machine Translation , 2009, EMNLP.

[17] William D. Lewis,et al. Intelligent Selection of Language Model Training Data , 2010, ACL.

[18] Roland Kuhn,et al. Discriminative Instance Weighting for Domain Adaptation in Statistical Machine Translation , 2010, EMNLP.