论文信息 - Towards Using Web-Crawled Data for Domain Adaptation in Statistical Machine Translation - 字舞流文

Towards Using Web-Crawled Data for Domain Adaptation in Statistical Machine Translation

This paper reports on the ongoing work focused on domain adaptation of statistical machine translation using domain-specific data obtained by domain-focused web crawling. We present a strategy for crawling monolingual and parallel data and their exploitation for testing, language modelling, and system tuning in a phrase-based machine translation framework. The proposed approach is evaluated on the domains of Natural Environment and Labour Legislation and two language pairs: English‐French and English‐Greek.

Pavel Pecina | Andy Way | V. Papavassiliou | Prokopis Prokopidis | Antonio Toral | M. Giagkou

[1] Peter Fankhauser,et al. Boilerplate detection using shallow text features , 2010, WSDM '10.

[2] Ankit K. Srivastava,et al. MATREX: The DCU MT System for WMT 2009 , 2009, WMT@EACL.

[3] Andreas Paepcke,et al. SpotSigs: robust and efficient near duplicate detection in large web collections , 2008, SIGIR '08.

[4] Eiichiro Sumita,et al. Dynamic Model Interpolation for Statistical Machine Translation , 2008, WMT@ACL.

[5] Preslav Nakov,et al. Improving English-Spanish Statistical Machine Translation: Experiments in Domain Adaptation, Sentence Paraphrasing, Tokenization, and Recasing , 2008, WMT@ACL.

[6] Josh Schroeder,et al. Experiments in Domain Adaptation for Statistical Machine Translation , 2007, WMT@ACL.

[7] Philipp Koehn,et al. Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[8] Dragos Stefan Munteanu,et al. Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[9] Hua Wu,et al. Alignment Model Adaptation for Domain-Specific Word Alignment , 2005, ACL.

[10] A. Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[11] Hua Wu,et al. Improving domain-specific word alignment with a general bilingual corpus , 2004, AMTA.

[12] Alexander H. Waibel,et al. Language Model Adaptation for Statistical Machine Translation Based on Information Retrieval , 2004, LREC.

[13] Franz Josef Och,et al. Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[14] Andreas Stolcke,et al. SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[15] Philippe Langlais,et al. Improving a general-purpose Statistical Translation Engine by Terminological lexicons , 2002, COLING 2002.

[16] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[17] G. Doddington. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics , 2002 .

[18] Andy Way,et al. Combining Multi-Domain Statistical Machine Translation Models using Automatic Classifiers , 2010, AMTA.

[19] Mikel L. Forcada,et al. Combining Content-Based and URL-Based Heuristics to Harvest Aligned Bitexts from Multilingual Sites with Bitextor , 2010, Prague Bull. Math. Linguistics.

[20] Philipp Koehn,et al. Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[21] Anders Ardö,et al. Focused crawling in the ALVIS semantic search engine , 2005 .

[22] Alex Waibel,et al. Adaptation of the translation model for statistical machine translation based on information retrieval , 2005, EAMT.

[23] Silvia Bernardini,et al. BootCaT: Bootstrapping Corpora and Terms from the Web , 2004, LREC.