论文信息 - Domain adaptation of statistical machine translation with domain-focused web crawling

Domain adaptation of statistical machine translation with domain-focused web crawling

In this paper, we tackle the problem of domain adaptation of statistical machine translation (SMT) by exploiting domain-specific data acquired by domain-focused crawling of text from the World Wide Web. We design and empirically evaluate a procedure for automatic acquisition of monolingual and parallel text and their exploitation for system training, tuning, and testing in a phrase-based SMT framework. We present a strategy for using such resources depending on their availability and quantity supported by results of a large-scale evaluation carried out for the domains of environment and labour legislation, two language pairs (English–French and English–Greek) and in both directions: into and from English. In general, machine translation systems trained and tuned on a general domain perform poorly on specific domains and we show that such systems can be adapted successfully by retuning model parameters using small amounts of parallel in-domain data, and may be further improved by using additional monolingual and parallel training data for adaptation of language and translation models. The average observed improvement in BLEU achieved is substantial at 15.30 points absolute.

[1] Dragos Stefan Munteanu,et al. Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[2] Philipp Koehn,et al. Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[3] Alex Waibel,et al. Adaptation of the translation model for statistical machine translation based on information retrieval , 2005, EAMT.

[4] Hsinchun Chen,et al. Using Genetic Algorithm in Building Domain-Specific Collections: An Experiment in the Nanotechnology Domain , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[5] Alain Désilets,et al. WeBiText: Building Large Heterogeneous Translation Memories from Parallel Web Content , 2008, TC.

[6] Adam Kilgarriff,et al. Introduction to the Special Issue on the Web as Corpus , 2003, CL.

[7] Philipp Koehn,et al. Experiments in Domain Adaptation for Statistical Machine Translation , 2007, WMT@ACL.

[8] Jianfeng Gao,et al. Domain Adaptation via Pseudo In-Domain Data Selection , 2011, EMNLP.

[9] William D. Lewis,et al. Intelligent Selection of Language Model Training Data , 2010, ACL.

[10] Roland Kuhn,et al. Mixture-Model Adaptation for SMT , 2007, WMT@ACL.

[11] Josef van Genabith,et al. Simple and Effective Parameter Tuning for Domain Adaptation of Statistical Machine Translation , 2012, COLING.

[12] Ralph Weischedel,et al. A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[13] Carl Lagoze,et al. Focused Crawls, Tunneling, and Digital Libraries , 2002, ECDL.

[14] Antonio Toral,et al. Hybrid Selection of Language Model Training Data Using Linguistic Information and Perplexity , 2013, HyTra@ACL.

[15] Mikel L. Forcada,et al. Combining Content-Based and URL-Based Heuristics to Harvest Aligned Bitexts from Multilingual Sites with Bitextor , 2010, Prague Bull. Math. Linguistics.

[16] Koraljka Golub,et al. Focused crawler software package , 2007 .

[17] Philipp Koehn,et al. Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[18] Hector Garcia-Molina,et al. Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[19] Hermann Ney,et al. Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[20] Josef van Genabith,et al. Quality Estimation-guided Data Selection for Domain Adaptation of SMT , 2013, MTSUMMIT.

[21] Miroslav Spousta,et al. Victor : the Web-Page Cleaning Tool , 2008 .

[22] Arianna Bisazza,et al. Fill-up versus interpolation methods for phrase-based SMT adaptation , 2011, IWSLT.

[23] Philippe Langlais,et al. Improving a general-purpose Statistical Translation Engine by Terminological lexicons , 2002, COLING 2002.

[24] Piotr Dziwiñski,et al. Ant Focused Crawling Algorithm , 2006, ICAISC.

[25] Preslav Nakov,et al. Improving English-Spanish Statistical Machine Translation: Experiments in Domain Adaptation, Sentence Paraphrasing, Tokenization, and Recasing , 2008, WMT@ACL.

[26] Eiichiro Sumita,et al. Dynamic Model Interpolation for Statistical Machine Translation , 2008, WMT@ACL.

[27] Filippo Menczer,et al. Mapping the semantics of Web text and links , 2005, IEEE Internet Computing.

[28] Josef van Genabith,et al. Domain Adaptation of Statistical Machine Translation using Web-Crawled Resources: A Case Study , 2012, EAMT.

[29] Franz Josef Och,et al. Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[30] Joel D. Martin,et al. Improving Translation Quality by Discarding Most of the Phrasetable , 2007, EMNLP.

[31] Andy Way,et al. Towards Using Web-Crawled Data for Domain Adaptation in Statistical Machine Translation , 2011, EAMT.