Domain adaptation of statistical machine translation with domain-focused web crawling

In this paper, we tackle the problem of domain adaptation of statistical machine translation (SMT) by exploiting domain-specific data acquired by domain-focused crawling of text from the World Wide Web. We design and empirically evaluate a procedure for automatic acquisition of monolingual and parallel text and their exploitation for system training, tuning, and testing in a phrase-based SMT framework. We present a strategy for using such resources depending on their availability and quantity supported by results of a large-scale evaluation carried out for the domains of environment and labour legislation, two language pairs (English–French and English–Greek) and in both directions: into and from English. In general, machine translation systems trained and tuned on a general domain perform poorly on specific domains and we show that such systems can be adapted successfully by retuning model parameters using small amounts of parallel in-domain data, and may be further improved by using additional monolingual and parallel training data for adaptation of language and translation models. The average observed improvement in BLEU achieved is substantial at 15.30 points absolute.

[1]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[2]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[3]  Alex Waibel,et al.  Adaptation of the translation model for statistical machine translation based on information retrieval , 2005, EAMT.

[4]  Hsinchun Chen,et al.  Using Genetic Algorithm in Building Domain-Specific Collections: An Experiment in the Nanotechnology Domain , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[5]  Alain Désilets,et al.  WeBiText: Building Large Heterogeneous Translation Memories from Parallel Web Content , 2008, TC.

[6]  Adam Kilgarriff,et al.  Introduction to the Special Issue on the Web as Corpus , 2003, CL.

[7]  Philipp Koehn,et al.  Experiments in Domain Adaptation for Statistical Machine Translation , 2007, WMT@ACL.

[8]  Jianfeng Gao,et al.  Domain Adaptation via Pseudo In-Domain Data Selection , 2011, EMNLP.

[9]  William D. Lewis,et al.  Intelligent Selection of Language Model Training Data , 2010, ACL.

[10]  Roland Kuhn,et al.  Mixture-Model Adaptation for SMT , 2007, WMT@ACL.

[11]  Josef van Genabith,et al.  Simple and Effective Parameter Tuning for Domain Adaptation of Statistical Machine Translation , 2012, COLING.

[12]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[13]  Carl Lagoze,et al.  Focused Crawls, Tunneling, and Digital Libraries , 2002, ECDL.

[14]  Antonio Toral,et al.  Hybrid Selection of Language Model Training Data Using Linguistic Information and Perplexity , 2013, HyTra@ACL.

[15]  Mikel L. Forcada,et al.  Combining Content-Based and URL-Based Heuristics to Harvest Aligned Bitexts from Multilingual Sites with Bitextor , 2010, Prague Bull. Math. Linguistics.

[16]  Koraljka Golub,et al.  Focused crawler software package , 2007 .

[17]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[18]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[19]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[20]  Josef van Genabith,et al.  Quality Estimation-guided Data Selection for Domain Adaptation of SMT , 2013, MTSUMMIT.

[21]  Miroslav Spousta,et al.  Victor : the Web-Page Cleaning Tool , 2008 .

[22]  Arianna Bisazza,et al.  Fill-up versus interpolation methods for phrase-based SMT adaptation , 2011, IWSLT.

[23]  Philippe Langlais,et al.  Improving a general-purpose Statistical Translation Engine by Terminological lexicons , 2002, COLING 2002.

[24]  Piotr Dziwiñski,et al.  Ant Focused Crawling Algorithm , 2006, ICAISC.

[25]  Preslav Nakov,et al.  Improving English-Spanish Statistical Machine Translation: Experiments in Domain Adaptation, Sentence Paraphrasing, Tokenization, and Recasing , 2008, WMT@ACL.

[26]  Eiichiro Sumita,et al.  Dynamic Model Interpolation for Statistical Machine Translation , 2008, WMT@ACL.

[27]  Filippo Menczer,et al.  Mapping the semantics of Web text and links , 2005, IEEE Internet Computing.

[28]  Josef van Genabith,et al.  Domain Adaptation of Statistical Machine Translation using Web-Crawled Resources: A Case Study , 2012, EAMT.

[29]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[30]  Joel D. Martin,et al.  Improving Translation Quality by Discarding Most of the Phrasetable , 2007, EMNLP.

[31]  Andy Way,et al.  Towards Using Web-Crawled Data for Domain Adaptation in Statistical Machine Translation , 2011, EAMT.

[32]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[33]  Jian-Yun Nie,et al.  Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web , 1999, SIGIR '99.

[34]  Filippo Menczer,et al.  Adaptive Retrieval Agents: Internalizing Local Context and Scaling up to the Web , 2000, Machine Learning.

[35]  Srinivas Bangalore,et al.  Harvesting Parallel Text in Multiple Languages with Limited Supervision , 2012, COLING.

[36]  Hua Wu,et al.  Improving domain-specific word alignment with a general bilingual corpus , 2004, AMTA.

[37]  Hermann Ney,et al.  Accelerated DP based search for statistical translation , 1997, EUROSPEECH.

[38]  Barry Haddow,et al.  Applying Pairwise Ranked Optimisation to Improve the Interpolation of Translation Models , 2013, NAACL.

[39]  Alexander M. Fraser,et al.  Domain Adaptation in Machine Translation : Final Report , 2013 .

[40]  Philipp Koehn,et al.  Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) , 2007 .

[41]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[42]  Silvia Bernardini,et al.  The WaCky wide web: a collection of very large linguistically processed web-crawled corpora , 2009, Lang. Resour. Evaluation.

[43]  Philip S. Yu Editorial: State of the Transactions , 2004, IEEE Trans. Knowl. Data Eng..

[44]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[45]  Brian D. Davison,et al.  Web page classification: Features and algorithms , 2009, CSUR.

[46]  Germán Sanchis-Trilles,et al.  Log-linear weight optimisation via Bayesian Adaptation in Statistical Machine Translation , 2010, COLING.

[47]  Marcello Federico,et al.  Domain Adaptation for Statistical Machine Translation with Monolingual Resources , 2009, WMT@EACL.

[48]  Hermann Ney,et al.  Combining translation and language model scoring for domain-specific data filtering , 2011, IWSLT.

[49]  Alexander H. Waibel,et al.  Language Model Adaptation for Statistical Machine Translation Based on Information Retrieval , 2004, LREC.

[50]  Matt Post,et al.  Domain Adaptation , 2017, Encyclopedia of Machine Learning and Data Mining.

[51]  Josef van Genabith,et al.  Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component Level Mixture Modelling , 2011, MTSUMMIT.

[52]  Yifan He,et al.  Improving the Post-Editing Experience using Translation Recommendation: A User Study , 2010, AMTA.

[53]  Andy Way,et al.  MaTrEx: The DCU MT System for WMT 2008 , 2008, WMT@ACL.

[54]  Rico Sennrich,et al.  Perplexity Minimization for Translation Model Domain Adaptation in Statistical Machine Translation , 2012, EACL.

[55]  Adam Kilgarriff,et al.  WebBootCaT. Instant Domain-Specific Corpora to Support Human Translators , 2006, EAMT.

[56]  Yuekui Yang,et al.  Focused Web Crawling Based on Incremental Learning , 2010 .

[57]  Hal Daumé,et al.  Domain Adaptation for Machine Translation by Mining Unseen Words , 2011, ACL.

[58]  Andy Way,et al.  Combining Multi-Domain Statistical Machine Translation Models using Automatic Classifiers , 2010, AMTA.

[59]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[60]  Barry Haddow,et al.  Improved Minimum Error Rate Training in Moses , 2009, Prague Bull. Math. Linguistics.

[61]  Philipp Koehn Interpolated Backoff for Factored Translation Models , 2012, AMTA.

[62]  Roland Kuhn,et al.  Discriminative Instance Weighting for Domain Adaptation in Statistical Machine Translation , 2010, EMNLP.

[63]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[64]  Mirella Lapata,et al.  Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning , 2007 .

[65]  András Kornai,et al.  Parallel corpora for medium density languages , 2007 .

[66]  Ignacio Garcia Dorado,et al.  Focused Crawling : algorithm survey and new approaches with a manual analysis , 2008 .

[67]  Panos Constantopoulos,et al.  Research and Advanced Technology for Digital Libraries , 2001, Lecture Notes in Computer Science.

[68]  Chengqing Zong,et al.  Domain Adaptation for Statistical Machine Translation with Domain Dictionary and Monolingual Corpora , 2008, COLING.

[69]  Peter Fankhauser,et al.  Boilerplate detection using shallow text features , 2010, WSDM '10.

[70]  Gregor Thurmair,et al.  A modular open-source focused crawler for mining monolingual and bilingual corpora from the web , 2013, BUCC@ACL.

[71]  Jiawei Han,et al.  PEBL: Web page classification without negative examples , 2004, IEEE Transactions on Knowledge and Data Engineering.

[72]  Chung-Hsing Yeh,et al.  Discovering Parallel Text from the World Wide Web , 2004, ACSW.

[73]  Marine Carpuat,et al.  Improving Statistical Machine Translation Using Word Sense Disambiguation , 2007, EMNLP.

[74]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[75]  Louisa Sadler,et al.  Structural Non-Correspondence in Translation , 1991, EACL.

[76]  Antonio Toral,et al.  Towards a User-Friendly Platform for Building Language Resources based on Web Services , 2012, LREC.

[77]  Raymond Flournoy,et al.  MT and Document Localization at Adobe: From Pilot to Production , 2009, MTSUMMIT.

[78]  Filippo Menczer,et al.  A General Evaluation Framework for Topical Crawlers , 2005, Information Retrieval.

[79]  Yanjun Ma,et al.  MaTrEx: The DCU MT System for WMT 2008 , 2008, WMT@ACL.

[80]  Ying Zhang,et al.  Automatic Acquisition of Chinese-English Parallel Corpus from the Web , 2006, ECIR.