SentSim: Crosslingual Semantic Evaluation of Machine Translation

Machine translation (MT) is currently evaluated in one of two ways: in a monolingual fashion, by comparison with the system output to one or more human reference translations, or in a trained crosslingual fashion, by building a supervised model to predict quality scores from human-labeled data. In this paper, we propose a more cost-effective, yet well performing unsupervised alternative SentSim: relying on strong pretrained multilingual word and sentence representations, we directly compare the source with the machine translated sentence, thus avoiding the need for both reference translations and labelled training data. The metric builds on state-of-the-art embedding-based approaches – namely BERTScore and Word Mover’s Distance – by incorporating a notion of sentence semantic similarity. By doing so, it achieves better correlation with human scores on different datasets. We show that it outperforms these and other metrics in the standard monolingual setting (MT-reference translation), a well as in the source-MT bilingual setting, where it performs on par with glass-box approaches to quality estimation that rely on MT model information.

[1]  Iryna Gurevych,et al.  Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks , 2021, NAACL.

[2]  Lucia Specia,et al.  QuEst - A translation quality estimation framework , 2013, ACL.

[3]  Lucia Specia,et al.  Unsupervised Quality Estimation for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[4]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[5]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[6]  Lucia Specia,et al.  BERGAMOT-LATTE Submissions for the WMT20 Quality Estimation Shared Task , 2020, WMT.

[7]  Chi-kiu Lo,et al.  MEANT 2.0: Accurate semantic MT evaluation for any output language , 2017, WMT.

[8]  Lucia Specia,et al.  WMDO: Fluency-based Word Mover’s Distance for Machine Translation Evaluation , 2019, WMT.

[9]  Lucia Specia,et al.  VIFIDEL: Evaluating the Visual Fidelity of Image Descriptions , 2019, ACL.

[10]  Noah A. Smith,et al.  Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts , 2019, ACL.

[11]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[12]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[13]  Timothy Baldwin,et al.  Putting Evaluation in Context: Contextual Embeddings Improve Machine Translation Evaluation , 2019, ACL.

[14]  Fabio Kepler,et al.  IST-Unbabel Participation in the WMT20 Quality Estimation Shared Task , 2020, WMT@EMNLP.

[15]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Roy Schwartz,et al.  Knowledge Enhanced Contextual Word Representations , 2019, EMNLP/IJCNLP.

[17]  Lucia Specia,et al.  A Bayesian non-linear method for feature selection in machine translation quality estimation , 2015, Machine Translation.

[18]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[19]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[20]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[21]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[22]  Thibault Sellam,et al.  BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.

[23]  Iryna Gurevych,et al.  Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation , 2020, EMNLP.

[24]  Mert Kilickaya,et al.  Re-evaluating Automatic Metrics for Image Captioning , 2016, EACL.

[25]  N. Razavian,et al.  BERT-XML: Large Scale Automated ICD Coding Using BERT Pretraining , 2020, CLINICALNLP.

[26]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[27]  Milton Abramowitz,et al.  Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables , 1964 .

[28]  Fei Liu,et al.  MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance , 2019, EMNLP.

[29]  Ondrej Bojar,et al.  Results of the WMT17 Metrics Shared Task , 2017, WMT.

[30]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[31]  Eneko Agirre,et al.  SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[32]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[33]  Chi-kiu Lo,et al.  YiSi - a Unified Semantic MT Quality Evaluation and Estimation Metric for Languages with Different Levels of Available Resources , 2019, WMT.

[34]  Hiroshi Echizen-ya,et al.  Word Embedding-Based Automatic MT Evaluation Metric using Word Position Information , 2019, NAACL.

[35]  Markus Freitag,et al.  Results of the WMT20 Metrics Shared Task , 2020, WMT.

[36]  Qun Liu,et al.  Blend: a Novel Combined MT Metric Based on Direct Assessment — CASICT-DCU submission to WMT17 Metrics Task , 2017, WMT.

[37]  Lucia Specia,et al.  Findings of the WMT 2020 Shared Task on Quality Estimation , 2020, WMT.

[38]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.