Statistical Post-Editing of Machine Translation for Domain Adaptation

This paper presents a statistical approach to adapt out-of-domain machine translation systems to the medical domain through an unsupervised post-editing step. A statistical post-editing model is built on statistical machine translation (SMT) outputs aligned with their translation references. Evaluations carried out to translate medical texts from French to English show that an out-of-domain machine translation system can be adapted a posteri-ori to a specific domain. Two SMT systems are studied: a state-of-the-art phrase-based implementation and an online publicly available system. Our experiments also indicate that selecting sentences for post-editing leads to significant improvements of translation quality and that more gains are still possible with respect to an oracle measure.

[1]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[2]  Stephan Vogel,et al.  Parallel Implementations of Word Alignment Tool , 2008, SETQALNLP.

[3]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[4]  Christopher Hogan,et al.  Toward the Development of a Post-Editing Module for Machine Translation Raw Output , 2000 .

[5]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[6]  Jörg Tiedemann,et al.  News from OPUS — A collection of multilingual parallel corpora with tools and interfaces , 2009 .

[7]  Gorka Labaka,et al.  Statistical Post-Editing : A Valuable Method in Domain Adaptation of RBMT Systems for Less-Resourced Languages , 2008 .

[8]  Kevin Knight,et al.  Automated Postediting of Documents , 1994, AAAI.

[9]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[10]  Juan C Sager,et al.  English Special Languages: Principles and Practice in Science and Technology , 1980 .

[11]  Roland Kuhn,et al.  Rule-Based Translation with Statistical Phrase-Based Post-Editing , 2007, WMT@ACL.

[12]  Andy Way,et al.  A cluster-based representation for multi-system MT evaluation , 2007 .

[13]  Hirokazu Suzuki,et al.  Automatic Post-Editing based on SMT and its selective application by Sentence-Level Automatic Quality Evaluation , 2011, MTSUMMIT.

[14]  Josef van Genabith,et al.  Statistical Post-Editing for a Statistical MT System , 2011, MTSUMMIT.

[15]  Daniel Marcu,et al.  A Phrase-Based,Joint Probability Model for Statistical Machine Translation , 2002, EMNLP.

[16]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[17]  Kemal Oflazer,et al.  Exploring Different Representational Units in English-to-Turkish Statistical Machine Translation , 2007, WMT@ACL.

[18]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[19]  Cyril Goutte,et al.  Domain adaptation of MT systems through automatic post-editing , 2007, MTSUMMIT.

[20]  Joel D. Martin,et al.  PORTAGE: A Phrase-Based Machine Translation System , 2005, ParallelText@ACL.

[21]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[22]  Robert Dale,et al.  United Nations General Assembly Resolutions : a six-language parallel corpus , 2009 .

[23]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[24]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[25]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[26]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[27]  Hermann Ney,et al.  Phrase-Based Statistical Machine Translation , 2002, KI.

[28]  Michel Simard,et al.  Statistical Phrase-Based Post-Editing , 2007, NAACL.

[29]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[30]  Philipp Koehn,et al.  Statistical Post-Editing on SYSTRAN‘s Rule-Based Translation System , 2007, WMT@ACL.

[31]  Philipp Koehn,et al.  Statistical Post Editing and Dictionary Extraction: Systran/Edinburgh Submissions for ACL-WMT2009 , 2009, WMT@EACL.