Unsupervised Quality Estimation Model for English to German Translation and Its Application in Extensive Supervised Evaluation

With the rapid development of machine translation (MT), the MT evaluation becomes very important to timely tell us whether the MT system makes any progress. The conventional MT evaluation methods tend to calculate the similarity between hypothesis translations offered by automatic translation systems and reference translations offered by professional translators. There are several weaknesses in existing evaluation metrics. Firstly, the designed incomprehensive factors result in language-bias problem, which means they perform well on some special language pairs but weak on other language pairs. Secondly, they tend to use no linguistic features or too many linguistic features, of which no usage of linguistic feature draws a lot of criticism from the linguists and too many linguistic features make the model weak in repeatability. Thirdly, the employed reference translations are very expensive and sometimes not available in the practice. In this paper, the authors propose an unsupervised MT evaluation metric using universal part-of-speech tagset without relying on reference translations. The authors also explore the performances of the designed metric on traditional supervised evaluation tasks. Both the supervised and unsupervised experiments show that the designed methods yield higher correlation scores with human judgments.

[1]  Dan Klein,et al.  Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.

[2]  Philipp Koehn,et al.  Findings of the 2011 Workshop on Statistical Machine Translation , 2011, WMT@EMNLP.

[3]  Lucia Specia,et al.  Machine translation evaluation versus quality estimation , 2010, Machine Translation.

[4]  Michael Gamon,et al.  Sentence-level MT evaluation without reference translations: beyond language modeling , 2005, EAMT.

[5]  Lucia Specia,et al.  Linguistic Features for Quality Estimation , 2012, WMT@NAACL-HLT.

[6]  Philipp Koehn,et al.  Findings of the 2012 Workshop on Statistical Machine Translation , 2012, WMT@NAACL-HLT.

[7]  Lidia S. Chao,et al.  Quality Estimation for Machine Translation Using the Joint Method of Evaluation Criteria and Statistical Modeling , 2013, WMT@ACL.

[8]  R. Darnell Translation , 1873, The Indian medical gazette.

[9]  Ying Zhang,et al.  Interpreting BLEU/NIST Scores: How Much Improvement do We Need to Have a Better System? , 2004, LREC.

[10]  Alon Lavie,et al.  Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems , 2011, WMT@EMNLP.

[11]  Eleftherios Avramidis,et al.  Comparative Quality Estimation: Automatic Sentence-Level Ranking of Multiple Machine Translation Outputs , 2012, COLING.

[12]  Wolfgang Macherey,et al.  Lattice-based Minimum Error Rate Training for Statistical Machine Translation , 2008, EMNLP.

[13]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[14]  Wojciech Skut,et al.  An Annotation Scheme for Free Word Order Languages , 1997, ANLP.

[15]  Philipp Koehn,et al.  Results of the WMT15 Metrics Shared Task , 2015, WMT@EMNLP.

[16]  P MarcusMitchell,et al.  Building a large annotated corpus of English , 1993 .

[17]  Shuly Wintner,et al.  Language Models for Machine Translation: Original vs. Translated Texts , 2011, CL.

[18]  Ling Zhu,et al.  Phrase Tagset Mapping for French and English Treebanks and Its Application in Machine Translation Evaluation , 2013, GSCL.

[19]  Rebecca Hwa,et al.  Regression for Sentence-Level MT Evaluation with Pseudo References , 2007, ACL.

[20]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[21]  José B. Mariño,et al.  N-gram-based Machine Translation , 2006, CL.

[22]  Chunyu Kit,et al.  ATEC: automatic evaluation of machine translation via word choice and word order , 2009, Machine Translation.

[23]  Roland Kuhn,et al.  Improving AMBER, an MT Evaluation Metric , 2012, WMT@NAACL-HLT.

[24]  Philipp Koehn,et al.  Proceedings of the Sixth Workshop on Statistical Machine Translation , 2011, WMT@EMNLP.

[25]  SpeciaLucia,et al.  Machine translation evaluation versus quality estimation , 2010 .

[26]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[27]  Marcello Federico,et al.  Match without a Referee: Evaluating MT Adequacy without Reference Translations , 2012, WMT@NAACL-HLT.

[28]  Lucia Specia,et al.  Combining Confidence Estimation and Reference-based Metrics for Segment-level MT Evaluation , 2010, AMTA.

[29]  Lidia S. Chao,et al.  LEPOR: A Robust Evaluation Metric for Machine Translation with Augmented Factors , 2012, COLING.

[30]  Xiaodong Zeng,et al.  Language-independent Model for Machine Translation Evaluation with Reinforced Factors , 2013, MTSUMMIT.

[31]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[32]  Eleftherios Avramidis,et al.  Evaluation without references: IBM1 scores as evaluation metrics , 2011, WMT@EMNLP.

[33]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[34]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[35]  W. N. Locke,et al.  Machine Translation of Languages: Fourteen Essays , 1955 .

[36]  Roland Kuhn,et al.  AMBER: A Modified BLEU, Enhanced Ranking Metric , 2011, WMT@EMNLP.

[37]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[38]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.