Machine Translation Evaluation: A Survey

This paper introduces the state-of-the-art machine translation (MT) evaluation survey that contains both manual and automatic evaluation methods. The traditional human evaluation criteria mainly include the intelligibility, fidelity, fluency, adequacy, comprehension, and informativeness. The advanced human assessments include task-oriented measures, post-editing, segment ranking, and extended criteriea, etc. We classify the automatic evaluation methods into two categories, including lexical similarity scenario and linguistic features application. The lexical similarity methods contain edit distance, precision, recall, F-measure, and word order. The linguistic features can be divided into syntactic features and semantic features respectively. The syntactic features include part of speech tag, phrase types and sentence structures, and the semantic features include named entity, synonyms, textual entailment, paraphrase, semantic roles, and language models. Subsequently, we also introduce the evaluation methods for MT evaluation including different correlation scores, and the recent quality estimation (QE) tasks for MT. This paper differs from the existing works \cite{GALEprogram2009,EuroMatrixProject2007} from several aspects, by introducing some recent development of MT evaluation measures, the different classifications from manual to automatic evaluation measures, the introduction of recent QE tasks of MT, and the concise construction of the content.

[1]  Alexander M. Fraser,et al.  Squibs and Discussions: Measuring Word Alignment Quality for Statistical Machine Translation , 2007, CL.

[2]  Eleftherios Avramidis,et al.  Evaluation without references: IBM1 scores as evaluation metrics , 2011, WMT@EMNLP.

[3]  Srinivas Bangalore,et al.  Evaluation Metrics for Generation , 2000, INLG.

[4]  Hermann Ney,et al.  Edit distances with block movements and error rate confidence estimates , 2009, Machine Translation.

[5]  A. Tejada,et al.  Evaluating Text-type Suitability for Machine Translation a Case Study on an English-Danish MT System , 1998 .

[6]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[7]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[8]  Cyril Goutte Automatic Evaluation of Machine Translation Quality , 2006 .

[9]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[10]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .

[11]  Hwee Tou Ng,et al.  Better Evaluation Metrics Lead to Better Machine Translation , 2011, EMNLP.

[12]  Philipp Koehn,et al.  Findings of the 2009 Workshop on Statistical Machine Translation , 2009, WMT@EACL.

[13]  Andy Way,et al.  Recent Advances in Example-Based Machine Translation , 2004 .

[14]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[15]  Kenneth Ward Church,et al.  Good applications for crummy machine translation , 1993, Machine Translation.

[16]  John S. White,et al.  The ARPA MT Evaluation Methodologies: Evolution, Lessons, and Future Approaches , 1994, AMTA.

[17]  Hermann Ney,et al.  Word Error Rates: Decomposition over POS classes and Applications for Error Analysis , 2007, WMT@ACL.

[18]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[19]  Lluís Màrquez i Villodre,et al.  A Smorgasbord of Features for Automatic MT Evaluation , 2008, WMT@ACL.

[20]  José B. Mariño,et al.  N-gram-based Machine Translation , 2006, CL.

[21]  Giorgio Satta,et al.  Factoring Synchronous Grammars by Sorting , 2006, ACL.

[22]  Sebastian Stüker,et al.  Overview of the IWSLT 2010 evaluation campaign , 2010, IWSLT.

[23]  A. Hald A history of mathematical statistics from 1750 to 1930 , 1998 .

[24]  Kamel Smaïli,et al.  “This sentence is wrong.” Detecting errors in machine-translated sentences , 2011, Machine Translation.

[25]  José A. R. Fonollosa,et al.  Syntax-based reordering for statistical machine translation , 2011, Comput. Speech Lang..

[26]  Josef van Genabith,et al.  ReVal: A Simple and Effective Machine Translation Evaluation Metric Based on Recurrent Neural Networks , 2015, EMNLP.

[27]  Michael Paul,et al.  Overview of the IWSLT 2009 evaluation campaign , 2009, IWSLT.

[28]  Philip Koehn,et al.  Statistical Machine Translation , 2010, EAMT.

[29]  David Chiang,et al.  A Hierarchical Phrase-Based Model for Statistical Machine Translation , 2005, ACL.

[30]  Mirella Lapata,et al.  Probabilistic Text Structuring: Experiments with Sentence Ordering , 2003, ACL.

[31]  Lucia Specia,et al.  Combining Confidence Estimation and Reference-based Metrics for Segment-level MT Evaluation , 2010, AMTA.

[32]  Clare R. Voss,et al.  Task-based Evaluation of Machine Translation (MT) Engines. Measuring How Well People Extract Who, When, Where-Type Elements in MT Output , 2006, EAMT.

[33]  Doug Arnold 8. Why translation is difficult for computers , 2003 .

[34]  Qun Liu,et al.  Achieving Accurate Conclusions in Evaluation of Automatic Machine Translation Metrics , 2016, NAACL.

[35]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[36]  Lidia S. Chao,et al.  LEPOR: A Robust Evaluation Metric for Machine Translation with Augmented Factors , 2012, COLING.

[37]  Mikel L. Forcada,et al.  Inferring Shallow-Transfer Machine Translation Rules from Small Parallel Corpora , 2014, J. Artif. Intell. Res..

[38]  K. Pearson On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is Such that it Can be Reasonably Supposed to have Arisen from Random Sampling , 1900 .

[39]  Elaine Marsh,et al.  MUC-7 Evaluation of IE Technology: Overview of Results , 1998, MUC.

[40]  Christian Buck Black Box Features for the WMT 2012 Quality Estimation Shared Task , 2012, WMT@NAACL-HLT.

[41]  Timothy Baldwin,et al.  Accurate Evaluation of Segment-level Machine Translation Metrics , 2015, NAACL.

[42]  Ding Liu,et al.  Syntactic Features for Evaluation of Machine Translation , 2005, IEEvaluation@ACL.

[43]  Trevor Cohn,et al.  Regression and Ranking based Optimisation for Sentence Level MT Evaluation , 2011, WMT@EMNLP.

[44]  Preslav Nakov,et al.  Machine Translation Evaluation with Neural Networks , 2017, Comput. Speech Lang..

[45]  Philipp Koehn,et al.  Findings of the 2010 Joint Workshop on Statistical Machine Translation and Metrics for Machine Translation , 2010, WMT@ACL.

[46]  Philipp Koehn,et al.  Findings of the 2014 Workshop on Statistical Machine Translation , 2014, WMT@ACL.

[47]  Kathleen McKeown,et al.  Paraphrasing Using Given and New Information in a Question-Answer System , 1979, ACL.

[48]  Keh-Yih Su,et al.  A New Quantitative Quality Measure for Machine Translation Systems , 1992, COLING.

[49]  Clare R. Voss,et al.  Task-based MT Evaluation: From Who/When/Where Extraction to Event Understanding , 2006, LREC.

[50]  Ido Dagan,et al.  PROBABILISTIC TEXTUAL ENTAILMENT: GENERIC APPLIED MODELING OF LANGUAGE VARIABILITY , 2004 .

[51]  Lidia S. Chao,et al.  Unsupervised Quality Estimation Model for English to German Translation and Its Application in Extensive Supervised Evaluation , 2014, TheScientificWorldJournal.

[52]  Ling Zhu,et al.  Phrase Tagset Mapping for French and English Treebanks and Its Application in Machine Translation Evaluation , 2013, GSCL.

[53]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[54]  Khalil Sima'an,et al.  Fitting Sentence Level Translation Evaluation with Many Dense Features , 2014, EMNLP.

[55]  Yiming Wang,et al.  A Description of Tunable Machine Translation Evaluation Systems in WMT13 Metrics Task , 2013, WMT@ACL.

[56]  Dekai Wu,et al.  Fully Automatic Semantic MT Evaluation , 2012, WMT@NAACL-HLT.

[57]  John D. Lafferty,et al.  Cranking: Combining Rankings Using Conditional Probability Models on Permutations , 2002, ICML.

[58]  Lluís Màrquez i Villodre,et al.  Linguistic Features for Automatic Evaluation of Heterogenous MT Systems , 2007, WMT@ACL.

[59]  Qun Liu,et al.  RED: A Reference Dependency Based MT Evaluation Metric , 2014, COLING.

[60]  Philipp Koehn,et al.  Findings of the 2011 Workshop on Statistical Machine Translation , 2011, WMT@EMNLP.

[61]  Philipp Koehn,et al.  Improved Statistical Machine Translation Using Paraphrases , 2006, NAACL.

[62]  John S. White,et al.  A task-oriented evaluation metric for machine translation , 1998, LREC.

[63]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[64]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[65]  W. N. Locke,et al.  Machine Translation of Languages: Fourteen Essays , 1955 .

[66]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[67]  M. King,et al.  FEMTI: creating and using a framework for MT evaluation , 2003, MTSUMMIT.

[68]  Krzysztof Marasek,et al.  Neural-based machine translation for medical text domain. Based on European Medicines Agency leaflet texts , 2015, CENTERIS/ProjMAN/HCist.

[69]  Pius ten Hacken Computers and translation: a translator's guide , 2004 .

[70]  Victoria Arranz,et al.  VERTa: Linguistic features in MT evaluation , 2012, LREC.

[71]  Khalil Sima'an,et al.  BEER: BEtter Evaluation as Ranking , 2014, WMT@ACL.

[72]  Jiajun Zhang,et al.  Deep Neural Networks in Machine Translation: An Overview , 2015, IEEE Intelligent Systems.

[73]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[74]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[75]  Qun Liu,et al.  MaxSD: A Neural Machine Translation Evaluation Metric Optimized by Maximizing Similarity Distance , 2016, NLPCC/ICCPOL.

[76]  Rebecca Hwa,et al.  A Re-examination of Machine Learning Approaches for Sentence-Level MT Evaluation , 2007, ACL.

[77]  Michael Gamon,et al.  Sentence-level MT evaluation without reference translations: beyond language modeling , 2005, EAMT.

[78]  Sebastian Stüker,et al.  Overview of the IWSLT 2011 evaluation campaign , 2011, IWSLT.

[79]  M. Kendall,et al.  Rank Correlation Methods , 1949 .

[80]  Lucia Specia,et al.  Predicting Machine Translation Adequacy , 2011, MTSUMMIT.

[81]  Guodong Zhou,et al.  Phrase-Based Evaluation for Machine Translation , 2012, COLING.

[82]  Chunyu Kit,et al.  ATEC: automatic evaluation of machine translation via word choice and word order , 2009, Machine Translation.

[83]  Philipp Koehn,et al.  Further Meta-Evaluation of Machine Translation , 2008, WMT@ACL.

[84]  Douglas C. Montgomery,et al.  Applied Statistics and Probability for Engineers, Third edition , 1994 .

[85]  Eduard H. Hovy,et al.  Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[86]  Lifeng Han,et al.  LEPOR: An Augmented Machine Translation Evaluation Metric , 2017, ArXiv.

[87]  Chin-Yew Lin,et al.  Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics , 2004, ACL.

[88]  Hang Li,et al.  Named entity recognition in query , 2009, SIGIR.

[89]  John S. White,et al.  Task-Based Evaluation for Machine Translation , 1999 .

[90]  Philipp Koehn,et al.  Findings of the 2013 Workshop on Statistical Machine Translation , 2013, WMT@ACL.

[91]  Philipp Koehn,et al.  Findings of the 2015 Workshop on Statistical Machine Translation , 2015, WMT@EMNLP.

[92]  Philipp Koehn,et al.  Findings of the 2012 Workshop on Statistical Machine Translation , 2012, WMT@NAACL-HLT.

[93]  Ion Androutsopoulos,et al.  A Survey of Paraphrasing and Textual Entailment Methods , 2009, J. Artif. Intell. Res..

[94]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[95]  Andrew M. Finch,et al.  Improving statistical machine translation using shallow linguistic knowledge , 2007, Comput. Speech Lang..

[96]  Wolfgang Macherey,et al.  Lattice-based Minimum Error Rate Training for Statistical Machine Translation , 2008, EMNLP.

[97]  Eleftherios Avramidis,et al.  Evaluate with Confidence Estimation: Machine ranking of translation outputs using grammatical features , 2011, WMT@EMNLP.

[98]  Philipp Koehn,et al.  Shared Task: Statistical Machine Translation between European Languages , 2005, ParallelText@ACL.

[99]  Preslav Nakov,et al.  Pairwise Neural Machine Translation Evaluation , 2015, ACL.

[100]  Dekai Wu,et al.  MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles , 2011, ACL.

[101]  Josef van Genabith,et al.  Machine Translation Evaluation using Recurrent Neural Networks , 2015, WMT@EMNLP.

[102]  John B. Carroll An experiment in evaluating the quality of translations , 1966, Mech. Transl. Comput. Linguistics.

[103]  Dekai Wu,et al.  Structured vs. Flat Semantic Role Representations for Machine Translation Evaluation , 2011, SSST@ACL.

[104]  R. Forthofer,et al.  Rank Correlation Methods , 1981 .

[105]  Philipp Koehn,et al.  (Meta-) Evaluation of Machine Translation , 2007, WMT@ACL.

[106]  Hermann Ney,et al.  Accelerated DP based search for statistical translation , 1997, EUROSPEECH.

[107]  Hwee Tou Ng,et al.  TESLA at WMT 2011: Translation Evaluation and Tunable Metric , 2011, WMT@EMNLP.

[108]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[109]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[110]  Joseph P. Turian,et al.  Evaluation of machine translation and its evaluation , 2003, MTSUMMIT.

[111]  Khalil Sima'an,et al.  Evaluating Word Order Recursively over Permutation-Forests , 2014, SSST@EMNLP.

[112]  Hiroshi Echizen-ya,et al.  Automatic Evaluation Method for Machine Translation Using Noun-Phrase Chunking , 2010, ACL.

[113]  Philipp Koehn,et al.  Manual and Automatic Evaluation of Machine Translation between European Languages , 2006, WMT@HLT-NAACL.

[114]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[115]  Regina Barzilay,et al.  Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment , 2003, NAACL.

[116]  Roland Kuhn,et al.  PORT: a Precision-Order-Recall MT Evaluation Metric for Tuning , 2012, ACL.

[117]  Chiori Hori,et al.  Overview of the IWSLT 2005 Evaluation Campaign , 2005, IWSLT.

[118]  Kristina Toutanova,et al.  Microsoft Research Treelet Translation System: NAACL 2006 Europarl Evaluation , 2006, WMT@HLT-NAACL.