STD: An Automatic Evaluation Metric for Machine Translation Based on Word Embeddings

Lexical-based metrics such as BLEU, NIST, and WER have been widely used in machine translation (MT) evaluation. However, these metrics badly represent semantic relationships and impose strict identity matching, leading to moderate correlation with human judgments. In this paper, we propose a novel MT automatic evaluation metric Semantic Travel Distance (STD) based on word embeddings. STD incorporates both semantic and lexical features (word embeddings and n-gram and word order) into one metric. It measures the semantic distance between the hypothesis and reference by calculating the minimum cumulative cost that the embedded n-grams of the hypothesis need to “travel” to reach the embedded n-grams of the reference. Experiment results show that STD has a better and more robust performance than a range of state-of-the-art metrics for both the segment-level and system-level evaluation.

[1]  Leonidas J. Guibas,et al.  A metric for distributions with applications to image databases , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[2]  Jianfeng Gao,et al.  Training MRF-Based Phrase Translation Models using Gradient Ascent , 2013, NAACL.

[3]  Junsong Yuan,et al.  Robust hand gesture recognition based on finger-earth mover's distance with a commodity depth camera , 2011, ACM Multimedia.

[4]  Michael Collins,et al.  Convolution Kernels for Natural Language , 2001, NIPS.

[5]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[6]  Nitin Madnani,et al.  TER-Plus: paraphrase, semantic, and alignment enhancements to Translation Edit Rate , 2009, Machine Translation.

[7]  Colin Cherry,et al.  A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEU , 2014, WMT@ACL.

[8]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[9]  Hervé Blanchon,et al.  Word2Vec vs DBnary: Augmenting METEOR using Vector Representations or Lexical Resources? , 2016, COLING.

[10]  Hermann Ney,et al.  Accelerated DP based search for statistical translation , 1997, EUROSPEECH.

[11]  Michael Werman,et al.  Fast and robust Earth Mover's Distances , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[12]  Arthur Cayley,et al.  The Collected Mathematical Papers: On Monge's “Mémoire sur la théorie des déblais et des remblais” , 2009 .

[13]  Ding Liu,et al.  Syntactic Features for Evaluation of Machine Translation , 2005, IEEvaluation@ACL.

[14]  Ondrej Bojar,et al.  Results of the WMT13 Metrics Shared Task , 2015, WMT@EMNLP.

[15]  Hongyu Guo,et al.  Representation Based Translation Evaluation Metrics , 2015, ACL.

[16]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[17]  Chin-Yew Lin,et al.  Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics , 2004, ACL.

[18]  Chunyu Kit,et al.  ATEC: automatic evaluation of machine translation via word choice and word order , 2009, Machine Translation.

[19]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[20]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[21]  Timothy Baldwin,et al.  Accurate Evaluation of Segment-level Machine Translation Metrics , 2015, NAACL.

[22]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[23]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[24]  Khalil Sima'an,et al.  Evaluating MT systems with BEER , 2015, Prague Bull. Math. Linguistics.

[25]  Meng Zhang,et al.  Building Earth Mover's Distance on Bilingual Word Embeddings for Machine Translation , 2016, AAAI.

[26]  Dekai Wu,et al.  Improving evaluation and optimization of MT systems against MEANT , 2015, WMT@EMNLP.

[27]  Michael Werman,et al.  A Linear Time Histogram Metric for Improved SIFT Matching , 2008, ECCV.

[28]  Hwee Tou Ng,et al.  Better Evaluation Metrics Lead to Better Machine Translation , 2011, EMNLP.

[29]  Mihaela Vela,et al.  Predicting Machine Translation Adequacy with Document Embeddings , 2015, WMT@EMNLP.

[30]  Preslav Nakov,et al.  Using Discourse Structure Improves Machine Translation Evaluation , 2014, ACL.

[31]  Lifeng Han,et al.  Machine Translation Evaluation Resources and Methods: A Survey. , 2016 .

[32]  Iryna Gurevych,et al.  A Monolingual Tree-based Translation Model for Sentence Simplification , 2010, COLING.

[33]  Mark Fishel,et al.  bleu2vec: the Painfully Familiar Metric on Continuous Vector Space Steroids , 2017, WMT.

[34]  Keh-Yih Su,et al.  A New Quantitative Quality Measure for Machine Translation Systems , 1992, COLING.

[35]  Hermann Ney,et al.  Word Error Rates: Decomposition over POS classes and Applications for Error Analysis , 2007, WMT@ACL.

[36]  David W. Jacobs,et al.  Approximate earth mover’s distance in linear time , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Wolfgang Menzel,et al.  UHH Submission to the WMT17 Quality Estimation Shared Task , 2017, WMT.

[38]  Alex Kulesza,et al.  Confidence Estimation for Machine Translation , 2004, COLING.

[39]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[40]  Gilles Sérasset,et al.  DBnary: Wiktionary as a Lemon-based multilingual lexical resource in RDF , 2015, Semantic Web.

[41]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[42]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[43]  Haibin Ling,et al.  An Efficient Earth Mover's Distance Algorithm for Robust Histogram Comparison , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Trevor Darrell,et al.  Fast contour matching using approximate earth mover's distance , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[45]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[46]  Hwee Tou Ng,et al.  MAXSIM: A Maximum Similarity Metric for Machine Translation Evaluation , 2008, ACL.

[47]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[48]  Maja Popovic,et al.  chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.

[49]  Hermann Ney,et al.  CharacTer: Translation Edit Rate on Character Level , 2016, WMT.

[50]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[51]  Harold L. Somers,et al.  Computers and translation : a translator's guide , 2003 .

[52]  Dekai Wu,et al.  MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles , 2011, ACL.

[53]  Philipp Koehn,et al.  Results of the WMT15 Metrics Shared Task , 2015, WMT@EMNLP.

[54]  Roland Kuhn,et al.  PORT: a Precision-Order-Recall MT Evaluation Metric for Tuning , 2012, ACL.

[55]  Dekai Wu,et al.  XMEANT: Better semantic MT evaluation without reference translations , 2014, ACL.

[56]  Haizhou Li,et al.  Adequacy–Fluency Metrics: Evaluating MT in the Continuous Space Model Framework , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[57]  Peter J. Bickel,et al.  The Earth Mover's distance is the Mallows distance: some insights from statistics , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[58]  Derek Greene,et al.  Practical solutions to the problem of diagonal dominance in kernel document clustering , 2006, ICML.

[59]  Bernhard Schölkopf,et al.  A Kernel Approach for Learning from Almost Orthogonal Patterns , 2002, European Conference on Principles of Data Mining and Knowledge Discovery.

[60]  Maja Popovic,et al.  chrF deconstructed: beta parameters and n-gram weights , 2016, WMT.

[61]  Nitin Madnani,et al.  Fluency, Adequacy, or HTER? Exploring Different Human Judgments with a Tunable MT Metric , 2009, WMT@EACL.