Automatic Evaluation of Translation Quality for Distant Language Pairs

Automatic evaluation of Machine Translation (MT) quality is essential to developing high-quality MT systems. Various evaluation metrics have been proposed, and BLEU is now used as the de facto standard metric. However, when we consider translation between distant language pairs such as Japanese and English, most popular metrics (e.g., BLEU, NIST, PER, and TER) do not work well. It is well known that Japanese and English have completely different word orders, and special care must be paid to word order in translation. Otherwise, translations with wrong word order often lead to misunderstanding and incomprehensibility. For instance, SMT-based Japanese-to-English translators tend to translate 'A because B' as 'B because A.' Thus, word order is the most important problem for distant language translation. However, conventional evaluation metrics do not significantly penalize such word order mistakes. Therefore, locally optimizing these metrics leads to inadequate translations. In this paper, we propose an automatic evaluation metric based on rank correlation coefficients modified with precision. Our meta-evaluation of the NTCIR-7 PATMT JE task data shows that this metric outperforms conventional metrics.

[1]  R. Forthofer,et al.  Rank Correlation Methods , 1981 .

[2]  Florence Reeder,et al.  Corpus-based comprehensive and diagnostic MT evaluation: initial Arabic, Chinese, French, and Spanish results , 2002 .

[3]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[4]  Eduard H. Hovy,et al.  Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[5]  I. Dan Melamed,et al.  Precision and Recall of Machine Translation , 2003, NAACL.

[6]  Yves Lepage,et al.  BLEU in Characters: Towards Automatic MT Evaluation in Languages without Word Delimiters , 2004, IJCNLP.

[7]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[8]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[9]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[10]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[11]  Automatic evaluation of machine translation based on recursive acquisition of an intuitive common parts continuum , 2007, MTSUMMIT.

[12]  Philipp Koehn,et al.  (Meta-) Evaluation of Machine Translation , 2007, WMT@ACL.

[13]  Masao Utiyama,et al.  Overview of the Patent Translation Task at the NTCIR-7 Workshop , 2008, NTCIR.

[14]  Alexandra Birch,et al.  Metrics for MT evaluation: evaluating reordering , 2010, Machine Translation.

[15]  Alexandra Birch,et al.  LRscore for Evaluating Lexical and Reordering Quality in MT , 2010, WMT@ACL.

[16]  Kevin Duh,et al.  Head Finalization: A Simple Reordering Rule for SOV Languages , 2010, WMT@ACL.

[17]  Manisha Sharma,et al.  Evaluation of machine translation , 2011, ICWET.