论文信息 - Domain Adaptation of Statistical Machine Translation using Web-Crawled Resources: A Case Study - 字舞流文

Domain Adaptation of Statistical Machine Translation using Web-Crawled Resources: A Case Study

We tackle the problem of domain adaptation of Statistical Machine Translation by exploiting domain-specific data acquired by domain-focused web-crawling. We design and evaluate a procedure for automatic acquisition of monolingual and parallel data and their exploitation for training, tuning, and testing in a phrase-based Statistical Machine Translation system. We present a strategy for using such resources depending on their availability and quantity supported by results of a large-scale evaluation on the domains of Natural Environment and Labour Legislation and two language pairs: English‐French, English-Greek. The average observed increase of BLEU is substantial at 49.5% relative.

Josef van Genabith | Antonio Toral | Pavel Pecina | Josef van Genabith | Prokopis Prokopidis | Vassilis Papavassiliou | Pavel Pecina | V. Papavassiliou | Prokopis Prokopidis | Antonio Toral

[1] Hua Wu,et al. Improving domain-specific word alignment with a general bilingual corpus , 2004, AMTA.

[2] Yuekui Yang,et al. Focused Web Crawling Based on Incremental Learning , 2010 .

[3] Philipp Koehn,et al. Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[4] Silvia Bernardini,et al. The WaCky wide web: a collection of very large linguistically processed web-crawled corpora , 2009, Lang. Resour. Evaluation.

[5] Hsinchun Chen,et al. Using Genetic Algorithm in Building Domain-Specific Collections: An Experiment in the Nanotechnology Domain , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[6] Hector Garcia-Molina,et al. Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[7] Víctor Pàmies,et al. Open Directory Project , 2003 .

[8] Piotr Dziwiñski,et al. Ant Focused Crawling Algorithm , 2006, ICAISC.

[9] Preslav Nakov,et al. Improving English-Spanish Statistical Machine Translation: Experiments in Domain Adaptation, Sentence Paraphrasing, Tokenization, and Recasing , 2008, WMT@ACL.

[10] Sergey Brin,et al. The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[11] Andy Way,et al. Combining Multi-Domain Statistical Machine Translation Models using Automatic Classifiers , 2010, AMTA.

[12] Filippo Menczer,et al. Mapping the semantics of Web text and links , 2005, IEEE Internet Computing.

[13] George R. Doddington,et al. Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[14] Franz Josef Och,et al. Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[15] Jiawei Han,et al. PEBL: Web page classification without negative examples , 2004, IEEE Transactions on Knowledge and Data Engineering.

[16] Ying Zhang,et al. Automatic Acquisition of Chinese-English Parallel Corpus from the Web , 2006, ECIR.

[17] Andy Way,et al. Towards Using Web-Crawled Data for Domain Adaptation in Statistical Machine Translation , 2011, EAMT.

[18] Philipp Koehn,et al. Experiments in Domain Adaptation for Statistical Machine Translation , 2007, WMT@ACL.

[19] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[20] Alain Désilets,et al. WeBiText: Building Large Heterogeneous Translation Memories from Parallel Web Content , 2008, TC.

[21] Andreas Stolcke,et al. SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[22] Chung-Hsing Yeh,et al. Discovering Parallel Text from the World Wide Web , 2004, ACSW.

[23] Jian-Yun Nie,et al. Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web , 1999, SIGIR '99.

[24] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[25] MarcuDaniel,et al. Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005 .

[26] Adam Kilgarriff,et al. WebBootCaT. Instant Domain-Specific Corpora to Support Human Translators , 2006, EAMT.

[27] Koraljka Golub,et al. Importance of HTML Structural Elements and Metadata in Automated Subject Classification , 2005, ECDL.

[28] Noah A. Smith,et al. The Web as a Parallel Corpus , 2003, CL.

[29] Eiichiro Sumita,et al. Dynamic Model Interpolation for Statistical Machine Translation , 2008, WMT@ACL.

[30] Ignacio Garcia Dorado,et al. Focused Crawling : algorithm survey and new approaches with a manual analysis , 2008 .

[31] András Kornai,et al. Parallel corpora for medium density languages , 2007 .

[32] Hua Wu,et al. Alignment Model Adaptation for Domain-Specific Word Alignment , 2005, ACL.

[33] Filippo Menczer,et al. Adaptive Retrieval Agents: Internalizing Local Context and Scaling up to the Web , 2000, Machine Learning.

[34] Brian D. Davison,et al. Web page classification: Features and algorithms , 2009, CSUR.

[35] Alexander H. Waibel,et al. Language Model Adaptation for Statistical Machine Translation Based on Information Retrieval , 2004, LREC.

[36] Raymond Flournoy,et al. MT and Document Localization at Adobe: From Pilot to Production , 2009, MTSUMMIT.

[37] Filippo Menczer,et al. A General Evaluation Framework for Topical Crawlers , 2005, Information Retrieval.

[38] Philippe Langlais,et al. Improving a general-purpose Statistical Translation Engine by Terminological lexicons , 2002, COLING 2002.

[39] Yifan He,et al. Improving the Post-Editing Experience using Translation Recommendation: A User Study , 2010, AMTA.

[40] Andy Way,et al. MaTrEx: The DCU MT System for WMT 2008 , 2008, WMT@ACL.

[41] Chengqing Zong,et al. Domain Adaptation for Statistical Machine Translation with Domain Dictionary and Monolingual Corpora , 2008, COLING.

[42] Peter Fankhauser,et al. Boilerplate detection using shallow text features , 2010, WSDM '10.

[43] Philipp Koehn,et al. Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[44] Mikel L. Forcada,et al. Combining Content-Based and URL-Based Heuristics to Harvest Aligned Bitexts from Multilingual Sites with Bitextor , 2010, Prague Bull. Math. Linguistics.

[45] Philipp Koehn,et al. Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[46] Koraljka Golub,et al. Focused crawler software package , 2007 .

[47] Dragos Stefan Munteanu,et al. Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[48] Alex Waibel,et al. Adaptation of the translation model for statistical machine translation based on information retrieval , 2005, EAMT.