Towards Using Web-Crawled Data for Domain Adaptation in Statistical Machine Translation

This paper reports on the ongoing work focused on domain adaptation of statistical machine translation using domain-specific data obtained by domain-focused web crawling. We present a strategy for crawling monolingual and parallel data and their exploitation for testing, language modelling, and system tuning in a phrase-based machine translation framework. The proposed approach is evaluated on the domains of Natural Environment and Labour Legislation and two language pairs: English‐French and English‐Greek.

[1]  Peter Fankhauser,et al.  Boilerplate detection using shallow text features , 2010, WSDM '10.

[2]  Ankit K. Srivastava,et al.  MATREX: The DCU MT System for WMT 2009 , 2009, WMT@EACL.

[3]  Andreas Paepcke,et al.  SpotSigs: robust and efficient near duplicate detection in large web collections , 2008, SIGIR '08.

[4]  Eiichiro Sumita,et al.  Dynamic Model Interpolation for Statistical Machine Translation , 2008, WMT@ACL.

[5]  Preslav Nakov,et al.  Improving English-Spanish Statistical Machine Translation: Experiments in Domain Adaptation, Sentence Paraphrasing, Tokenization, and Recasing , 2008, WMT@ACL.

[6]  Josh Schroeder,et al.  Experiments in Domain Adaptation for Statistical Machine Translation , 2007, WMT@ACL.

[7]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[8]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[9]  Hua Wu,et al.  Alignment Model Adaptation for Domain-Specific Word Alignment , 2005, ACL.

[10]  A. Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[11]  Hua Wu,et al.  Improving domain-specific word alignment with a general bilingual corpus , 2004, AMTA.

[12]  Alexander H. Waibel,et al.  Language Model Adaptation for Statistical Machine Translation Based on Information Retrieval , 2004, LREC.

[13]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[14]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[15]  Philippe Langlais,et al.  Improving a general-purpose Statistical Translation Engine by Terminological lexicons , 2002, COLING 2002.

[16]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[17]  G. Doddington Automatic evaluation of machine translation quality using n-gram co-occurrence statistics , 2002 .

[18]  Andy Way,et al.  Combining Multi-Domain Statistical Machine Translation Models using Automatic Classifiers , 2010, AMTA.

[19]  Mikel L. Forcada,et al.  Combining Content-Based and URL-Based Heuristics to Harvest Aligned Bitexts from Multilingual Sites with Bitextor , 2010, Prague Bull. Math. Linguistics.

[20]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[21]  Anders Ardö,et al.  Focused crawling in the ALVIS semantic search engine , 2005 .

[22]  Alex Waibel,et al.  Adaptation of the translation model for statistical machine translation based on information retrieval , 2005, EAMT.

[23]  Silvia Bernardini,et al.  BootCaT: Bootstrapping Corpora and Terms from the Web , 2004, LREC.