Topic Model Based Adaptation Data Selection for Domain-Specific Machine Translation

Current domain-specific machine translation (MT) suffers from the lack of high-quality bilingual corpora. Existing work in this field has shown the advantage of Adaptation data selection (Ada-selection) for enriching the corpora. Encouraged by the empirical finding that topic distribution is conductive to characterizing a distinctive domain, we propose to use topic model to improve Ada-selection. Based on a joint LDA approach, we incorporate topic distribution in measuring the relevance between the target domain and the candidate parallel sentence pairs. On the basis, we select the highly relevant candidates as the high-quality domain-specific bilingual corpora. In practice, we apply our method for the acquisition of domain-specific corpora from the general-domain. Experiments on an end-to-end domain-specific MT task show that our method outperforms the state of the art, yielding at least 1.5 BLEU points at different scales of training data.

[1]  Eiichiro Sumita,et al.  Method of Selecting Training Data to Build a Compact and Efficient Translation Model , 2008, IJCNLP.

[2]  Eric P. Xing,et al.  HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation , 2007, NIPS.

[3]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[4]  Andy Way,et al.  Towards Using Web-Crawled Data for Domain Adaptation in Statistical Machine Translation , 2011, EAMT.

[5]  Heng Ji,et al.  An Iterative Link-based Method for Parallel Web Page Mining , 2014, EMNLP.

[6]  Lucia Specia,et al.  Topic models for translation quality estimation for gisting purposes , 2013 .

[7]  Hao Liu,et al.  Effective Selection of Translation Model Training Data , 2014, ACL.

[8]  Marcello Federico,et al.  Domain Adaptation for Statistical Machine Translation with Monolingual Resources , 2009, WMT@EACL.

[9]  Eric P. Xing,et al.  BiTAM: Bilingual Topic AdMixture Models for Word Alignment , 2006, ACL.

[10]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[11]  Qun Liu,et al.  Improving Statistical Machine Translation Performance by Training Data Selection and Optimization , 2007, EMNLP-CoNLL.

[12]  Kevin Duh,et al.  Adaptation Data Selection using Neural Language Models: Experiments in Machine Translation , 2013, ACL.

[13]  David Chiang,et al.  Hierarchical Phrase-Based Translation , 2007, CL.

[14]  William D. Lewis,et al.  Intelligent Selection of Language Model Training Data , 2010, ACL.

[15]  Qun Liu,et al.  Translation Model Adaptation for Statistical Machine Translation with Monolingual Topic Information , 2012, ACL.

[16]  John D. Lafferty,et al.  A correlated topic model of Science , 2007, 0708.3601.

[17]  Vladimir Eidelman,et al.  Topic Models for Dynamic Translation Model Adaptation , 2012, ACL.

[18]  Tanja Schultz,et al.  Bilingual LSA-based adaptation for statistical machine translation , 2007, Machine Translation.

[19]  Roland Kuhn,et al.  Discriminative Instance Weighting for Domain Adaptation in Statistical Machine Translation , 2010, EMNLP.

[20]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[21]  Spyridon Matsoukas,et al.  Discriminative Corpus Weight Estimation for Machine Translation , 2009, EMNLP.

[22]  Jianfeng Gao,et al.  Domain Adaptation via Pseudo In-Domain Data Selection , 2011, EMNLP.

[23]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[24]  Yu Zhang,et al.  Statistical Machine Translation based on LDA , 2010, 2010 4th International Universal Communication Symposium.

[25]  Ming Zhou,et al.  Learning Topic Representation for SMT with Neural Networks , 2014, ACL.