Automatic Post-Editing for Vietnamese

Automatic post-editing (APE) is an important remedy for reducing errors of raw translated texts that are produced by machine translation (MT) systems or software-aided translation. In this paper, we present a systematic approach to tackle the APE task for Vietnamese. Specifically, we construct the first large-scale dataset of 5M Vietnamese translated and corrected sentence pairs. We then apply strong neural MT models to handle the APE task, using our constructed dataset. Experimental results from both automatic and human evaluations show the effectiveness of the neural MT models in handling the Vietnamese APE task.

[1]  Marta R. Costa-jussà,et al.  Findings of the 2019 Conference on Machine Translation (WMT19) , 2019, WMT.

[2]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[3]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[4]  Josef van Genabith,et al.  A Neural Network based Approach to Automatic Post-Editing , 2016, ACL.

[5]  Michel Simard,et al.  Statistical Phrase-Based Post-Editing , 2007, NAACL.

[6]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[7]  Midori Tatsumi,et al.  Post-editing machine translated text in a commercial setting: Observation and statistical analysis , 2010 .

[8]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[9]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[10]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[11]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[12]  Anna Zaretskaya,et al.  Comparing Post-Editing Difficulty of Different Machine Translation Errors in Spanish and German Translations from English , 2016 .

[13]  Francisco Casacuberta,et al.  Statistical Post-Editing of a Rule-Based Machine Translation System , 2009, NAACL.

[14]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[15]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[16]  Stefan Thater,et al.  Sequence to Sequence Learning for Event Prediction , 2017, IJCNLP.

[17]  Dai Quoc Nguyen,et al.  VnCoreNLP: A Vietnamese Natural Language Processing Toolkit , 2018, NAACL.

[18]  Dai Quoc Nguyen,et al.  A Fast and Accurate Vietnamese Word Segmenter , 2017, LREC.