Linguistic measures for automatic machine translation evaluation

Assessing the quality of candidate translations involves diverse linguistic facets. However, most automatic evaluation methods in use today rely on limited quality assumptions, such as lexical similarity. This introduces a bias in the development cycle which in some cases has been reported to carry very negative consequences. In order to tackle this methodological problem, we explore a novel path towards heterogeneous automatic Machine Translation evaluation. We have compiled a rich set of specialized similarity measures operating at different linguistic dimensions and analyzed their individual and collective behaviour over a wide range of evaluation scenarios. Results show that measures based on syntactic and semantic information are able to provide more reliable system rankings than lexical measures, especially when the systems under evaluation are based on different paradigms. At the sentence level, while some linguistic measures perform better than most lexical measures, some others perform substantially worse, mainly due to parsing problems. Their scores are, however, suitable for combination, yielding a substantially improved evaluation quality.

[1]  Rebecca Hwa,et al.  A Re-examination of Machine Learning Approaches for Sentence-Level MT Evaluation , 2007, ACL.

[2]  Michael Gamon,et al.  Sentence-level MT evaluation without reference translations: beyond language modeling , 2005, EAMT.

[3]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[4]  Deborah A. Coughlin,et al.  Correlating automated and human assessments of machine translation quality , 2003, MTSUMMIT.

[5]  Eugene Charniak,et al.  Coarse-to-Fine n-Best Parsing and MaxEnt Discriminative Reranking , 2005, ACL.

[6]  R. Fisher 014: On the "Probable Error" of a Coefficient of Correlation Deduced from a Small Sample. , 1921 .

[7]  Daniel Gildea,et al.  The Proposition Bank: An Annotated Corpus of Semantic Roles , 2005, CL.

[8]  Eiichiro Sumita,et al.  Using multiple edit distances to automatically rank machine translation output , 2001, MTSUMMIT.

[9]  Chris Callison-Burch Linear B System Description for the 2005 NIST MT Evaluation Exercise , 2005 .

[10]  Jimmy J. Lin,et al.  A Paraphrase-Based Approach to Machine Translation Evaluation , 2005 .

[11]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[12]  Rebecca Hwa,et al.  The Role of Pseudo References in MT Evaluation , 2008, WMT@ACL.

[13]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[14]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[15]  Regina Barzilay,et al.  Paraphrasing for Automatic Evaluation , 2006, NAACL.

[16]  Joseph P. Turian,et al.  Evaluation of machine translation and its evaluation , 2003, MTSUMMIT.

[17]  Dekang Lin,et al.  Dependency-Based Evaluation of Minipar , 2003 .

[18]  Alon Lavie,et al.  BLANC: Learning Evaluation Metrics for MT , 2005, HLT.

[19]  Hwee Tou Ng,et al.  MAXSIM: A Maximum Similarity Metric for Machine Translation Evaluation , 2008, ACL.

[20]  Lluís Màrquez i Villodre,et al.  Towards Heterogeneous Automatic MT Error Analysis , 2008, LREC.

[21]  Ding Liu,et al.  Source-Language Features and Maximum Correlation Training for Machine Translation Evaluation , 2007, NAACL.

[22]  Daniel Jurafsky,et al.  Measuring machine translation quality as semantic equivalence: A metric based on entailment features , 2009, Machine Translation.

[23]  Michael Paul,et al.  Overview of the IWSLT06 evaluation campaign , 2006, IWSLT.

[24]  Hermann Ney,et al.  Accelerated DP based search for statistical translation , 1997, EUROSPEECH.

[25]  Alex Kulesza,et al.  A learning approach to improving sentence-level MT evaluation , 2004 .

[26]  I. Dan Melamed,et al.  Precision and Recall of Machine Translation , 2003, NAACL.

[27]  F. Reeder,et al.  The naming of things and the confusion of tongues: an MT metric , 2001, MTSUMMIT.

[28]  Philipp Koehn,et al.  (Meta-) Evaluation of Machine Translation , 2007, WMT@ACL.

[29]  Johan Bos,et al.  Linguistically Motivated Large-Scale NLP with C&C and Boxer , 2007, ACL.

[30]  Philipp Koehn,et al.  Proceedings of the Fourth Workshop on Statistical Machine Translation, WMT@EACL 2009, Athens, Greece, March 30-31, 2009 , 2009, WMT@EACL.

[31]  Andy Way,et al.  Contextual Bitext-Derived Paraphrases in Automatic MT Evaluation , 2006, WMT@HLT-NAACL.

[32]  Ming Zhou,et al.  Sentence Level Machine Translation Evaluation as a Ranking , 2007, WMT@ACL.

[33]  Liang Zhou,et al.  Re-evaluating Machine Translation Results with Paraphrase Support , 2006, EMNLP.

[34]  R. Fisher 036: On a Distribution Yielding the Error Functions of Several Well Known Statistics. , 1924 .

[35]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[36]  Hermann Ney,et al.  An Evaluation Tool for Machine Translation: Fast Evaluation for MT Research , 2000, LREC.

[37]  Philipp Koehn,et al.  Further Meta-Evaluation of Machine Translation , 2008, WMT@ACL.

[38]  P. Jaccard Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines , 1901 .

[39]  Andy Way,et al.  Dependency-Based Automatic Evaluation for Machine Translation , 2007, SSST@HLT-NAACL.

[40]  C. Spearman The proof and measurement of association between two things. By C. Spearman, 1904. , 1987, The American journal of psychology.

[41]  Rebecca Hwa,et al.  Regression for Sentence-Level MT Evaluation with Pseudo References , 2007, ACL.

[42]  Philipp Koehn,et al.  Manual and Automatic Evaluation of Machine Translation between European Languages , 2006, WMT@HLT-NAACL.

[43]  Lluís Màrquez i Villodre,et al.  SVMTool: A general POS Tagger Generator Based on Support Vector Machines , 2004, LREC.

[44]  Mari Ostendorf,et al.  Expected dependency pair match: predicting translation quality with expected syntactic structure , 2009, Machine Translation.

[45]  Mihai Surdeanu,et al.  Semantic Role Labeling Using Complete Syntactic Analysis , 2005, CoNLL.

[46]  Alon Lavie,et al.  METEOR-NEXT and the METEOR Paraphrase Tables: Improved Evaluation Support for Five Target Languages , 2010, WMT@ACL.

[47]  Lluís Màrquez i Villodre,et al.  Asiya: An Open Toolkit for Automatic Machine Translation (Meta-)Evaluation , 2010, Prague Bull. Math. Linguistics.

[48]  Chin-Yew Lin,et al.  ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation , 2004, COLING.

[49]  Philipp Koehn,et al.  Findings of the 2009 Workshop on Statistical Machine Translation , 2009, WMT@EACL.

[50]  Philipp Koehn,et al.  EACL 2006, 11st Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, April 3-7, 2006, Trento, Italy , 2006 .

[51]  Mark Przybocki,et al.  NIST 2005 machine translation evaluation official results , 2005 .

[52]  Hermann Ney,et al.  Word Error Rates: Decomposition over POS classes and Applications for Error Analysis , 2007, WMT@ACL.

[53]  Karl Pearson,et al.  The life, letters and labours of Francis Galton, by Karl Pearson ... , 1914 .

[54]  Chin-Yew Lin,et al.  Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics , 2004, ACL.

[55]  Mihai Surdeanu,et al.  Named entity recognition from spontaneous open-domain speech , 2005, INTERSPEECH.

[56]  Dennis Mehay,et al.  BLEUÂTRE: flattening syntactic dependencies for MT evaluation , 2007, TMI.

[57]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[58]  Nitin Madnani,et al.  TER-Plus: paraphrase, semantic, and alignment enhancements to Translation Edit Rate , 2009, Machine Translation.

[59]  Cyril Goutte Automatic Evaluation of Machine Translation Quality , 2006 .

[60]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[61]  John S. White,et al.  The ARPA MT Evaluation Methodologies: Evolution, Lessons, and Future Approaches , 1994, AMTA.

[62]  Hermann Ney,et al.  CDER: Efficient MT Evaluation Using Block Movements , 2006, EACL.

[63]  Chris Quirk,et al.  Dependency Treelet Translation: Syntactically Informed Phrasal SMT , 2005, ACL.

[64]  Ido Dagan,et al.  Proceedings of the 24th Conference on Computational Natural Language Learning , 2005 .

[65]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[66]  Ding Liu,et al.  Stochastic Iterative Alignment for Machine Translation Evaluation , 2006, ACL.

[67]  E. Sumita,et al.  Reducing human assessment of machine translation quality to binary classifiers , 2007, TMI.

[68]  Daniel Jurafsky,et al.  Machine Translation Evaluation with Textual Entailment Features , 2009, WMT@EACL.

[69]  Christopher Culy,et al.  The limits of n-gram translation evaluation metrics , 2003, MTSUMMIT.

[70]  Christopher D. Manning,et al.  Textual entailment features for machine translation evaluation , 2009 .

[71]  Leonard Darwin,et al.  The Life, Letters and Labours of Francis Galton , 1925, Nature.

[72]  Ding Liu,et al.  Syntactic Features for Evaluation of Machine Translation , 2005, IEEvaluation@ACL.

[73]  Chris Quirk,et al.  Training a Sentence-Level Machine Translation Confidence Measure , 2004, LREC.

[74]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments , 2007, WMT@ACL.

[75]  Robert Tibshirani,et al.  Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy , 1986 .

[76]  Michael Gamon,et al.  A Machine Learning Approach to the Automatic Evaluation of Machine Translation , 2001, ACL.