3-Step Parallel Corpus Cleaning Using Monolingual Crowd Workers

A high-quality parallel corpus needs to be manually created to achieve good machine translation for the domains which do not have enough existing resources. Although the quality of the corpus to some extent can be improved by asking the professional translators to translate, it is impossible to completely avoid making any mistakes. In this paper, we propose a framework for cleaning the existing professionally-translated parallel corpus in a quick and cheap way. The proposed method uses a 3-step crowdsourcing procedure to efficiently detect and edit the translation flaws, and also guarantees the reliability of the edits. The experiments using the fashion-domain e-commerce-site (EC-site) parallel corpus show the effectiveness of the proposed method for the parallel corpus cleaning.

[1]  Sadao Kurohashi,et al.  KyotoEBMT: An Example-Based Dependency-to-Dependency Translation Framework , 2014, ACL.

[2]  Jaime G. Carbonell,et al.  Active Learning and Crowd-Sourcing for Machine Translation , 2010, LREC.

[3]  Sadao Kurohashi,et al.  Alignment by Bilingual Generation and Monolingual Derivation , 2012, COLING.

[4]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[5]  Hitoshi Isahara,et al.  Building an Annotated Japanese-Chinese Parallel Corpus - A Part of NICT Multilingual Corpora , 2005, IJCNLP.

[6]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[7]  Chenhui Chu,et al.  Accurate Parallel Fragment Extraction from Quasi–Comparable Corpora using Alignment Model and Translation Lexicon , 2013, IJCNLP.

[8]  Chris Callison-Burch,et al.  Crowdsourcing Translation: Professional Quality from Non-Professionals , 2011, ACL.

[9]  Stephan Vogel,et al.  Can Crowds Build parallel corpora for Machine Translation Systems? , 2010, Mturk@HLT-NAACL.

[10]  Chenhui Chu,et al.  Chinese–Japanese Parallel Sentence Extraction from Quasi–Comparable Corpora , 2013, BUCC@ACL.

[11]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[12]  Kristina Toutanova,et al.  Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment , 2010, NAACL.

[13]  Jakob Uszkoreit,et al.  Large Scale Parallel Document Mining for Machine Translation , 2010, COLING.

[14]  M. Utiyama,et al.  A Japanese-English patent parallel corpus , 2007, MTSUMMIT.

[15]  Lane Schwartz Monolingual Post-Editing by a Domain Expert is Highly Effective for Translation Triage , 2014 .

[16]  A. D. Ilarraza,et al.  Comparison of post-editing productivity between professional translators and lay users , 2014, AMTA.

[17]  Hiroshi Nakano,et al.  Development of "Chinese - Japanese Bilingual Corpus" and Its Remaining Tasks , 1999 .