Domain Adaptation of Statistical Machine Translation using Web-Crawled Resources: A Case Study

We tackle the problem of domain adaptation of Statistical Machine Translation by exploiting domain-specific data acquired by domain-focused web-crawling. We design and evaluate a procedure for automatic acquisition of monolingual and parallel data and their exploitation for training, tuning, and testing in a phrase-based Statistical Machine Translation system. We present a strategy for using such resources depending on their availability and quantity supported by results of a large-scale evaluation on the domains of Natural Environment and Labour Legislation and two language pairs: English‐French, English-Greek. The average observed increase of BLEU is substantial at 49.5% relative.

[1]  Hua Wu,et al.  Improving domain-specific word alignment with a general bilingual corpus , 2004, AMTA.

[2]  Yuekui Yang,et al.  Focused Web Crawling Based on Incremental Learning , 2010 .

[3]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[4]  Silvia Bernardini,et al.  The WaCky wide web: a collection of very large linguistically processed web-crawled corpora , 2009, Lang. Resour. Evaluation.

[5]  Hsinchun Chen,et al.  Using Genetic Algorithm in Building Domain-Specific Collections: An Experiment in the Nanotechnology Domain , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[6]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[7]  Víctor Pàmies,et al.  Open Directory Project , 2003 .

[8]  Piotr Dziwiñski,et al.  Ant Focused Crawling Algorithm , 2006, ICAISC.

[9]  Preslav Nakov,et al.  Improving English-Spanish Statistical Machine Translation: Experiments in Domain Adaptation, Sentence Paraphrasing, Tokenization, and Recasing , 2008, WMT@ACL.

[10]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[11]  Andy Way,et al.  Combining Multi-Domain Statistical Machine Translation Models using Automatic Classifiers , 2010, AMTA.

[12]  Filippo Menczer,et al.  Mapping the semantics of Web text and links , 2005, IEEE Internet Computing.

[13]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[14]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[15]  Jiawei Han,et al.  PEBL: Web page classification without negative examples , 2004, IEEE Transactions on Knowledge and Data Engineering.

[16]  Ying Zhang,et al.  Automatic Acquisition of Chinese-English Parallel Corpus from the Web , 2006, ECIR.

[17]  Andy Way,et al.  Towards Using Web-Crawled Data for Domain Adaptation in Statistical Machine Translation , 2011, EAMT.

[18]  Philipp Koehn,et al.  Experiments in Domain Adaptation for Statistical Machine Translation , 2007, WMT@ACL.

[19]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[20]  Alain Désilets,et al.  WeBiText: Building Large Heterogeneous Translation Memories from Parallel Web Content , 2008, TC.

[21]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[22]  Chung-Hsing Yeh,et al.  Discovering Parallel Text from the World Wide Web , 2004, ACSW.

[23]  Jian-Yun Nie,et al.  Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web , 1999, SIGIR '99.

[24]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[25]  MarcuDaniel,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005 .

[26]  Adam Kilgarriff,et al.  WebBootCaT. Instant Domain-Specific Corpora to Support Human Translators , 2006, EAMT.

[27]  Koraljka Golub,et al.  Importance of HTML Structural Elements and Metadata in Automated Subject Classification , 2005, ECDL.

[28]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[29]  Eiichiro Sumita,et al.  Dynamic Model Interpolation for Statistical Machine Translation , 2008, WMT@ACL.

[30]  Ignacio Garcia Dorado,et al.  Focused Crawling : algorithm survey and new approaches with a manual analysis , 2008 .

[31]  András Kornai,et al.  Parallel corpora for medium density languages , 2007 .

[32]  Hua Wu,et al.  Alignment Model Adaptation for Domain-Specific Word Alignment , 2005, ACL.

[33]  Filippo Menczer,et al.  Adaptive Retrieval Agents: Internalizing Local Context and Scaling up to the Web , 2000, Machine Learning.

[34]  Brian D. Davison,et al.  Web page classification: Features and algorithms , 2009, CSUR.

[35]  Alexander H. Waibel,et al.  Language Model Adaptation for Statistical Machine Translation Based on Information Retrieval , 2004, LREC.

[36]  Raymond Flournoy,et al.  MT and Document Localization at Adobe: From Pilot to Production , 2009, MTSUMMIT.

[37]  Filippo Menczer,et al.  A General Evaluation Framework for Topical Crawlers , 2005, Information Retrieval.

[38]  Philippe Langlais,et al.  Improving a general-purpose Statistical Translation Engine by Terminological lexicons , 2002, COLING 2002.

[39]  Yifan He,et al.  Improving the Post-Editing Experience using Translation Recommendation: A User Study , 2010, AMTA.

[40]  Andy Way,et al.  MaTrEx: The DCU MT System for WMT 2008 , 2008, WMT@ACL.

[41]  Chengqing Zong,et al.  Domain Adaptation for Statistical Machine Translation with Domain Dictionary and Monolingual Corpora , 2008, COLING.

[42]  Peter Fankhauser,et al.  Boilerplate detection using shallow text features , 2010, WSDM '10.

[43]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[44]  Mikel L. Forcada,et al.  Combining Content-Based and URL-Based Heuristics to Harvest Aligned Bitexts from Multilingual Sites with Bitextor , 2010, Prague Bull. Math. Linguistics.

[45]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[46]  Koraljka Golub,et al.  Focused crawler software package , 2007 .

[47]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[48]  Alex Waibel,et al.  Adaptation of the translation model for statistical machine translation based on information retrieval , 2005, EAMT.