Global Explainability of BERT-Based Evaluation Metrics by Disentangling along Linguistic Factors

Evaluation metrics are a key ingredient for progress of text generation systems. In recent years, several BERT-based evaluation metrics have been proposed (including BERTScore, MoverScore, BLEURT, etc.) which correlate much better with human assessment of text generation quality than BLEU or ROUGE, invented two decades ago. However, little is known what these metrics, which are based on black-box language model representations, actually capture (it is typically assumed they model semantic similarity). In this work, we use a simple regression based global explainability technique to disentangle metric scores along linguistic factors, including semantics, syntax, morphology, and lexical overlap. We show that the different metrics capture all aspects to some degree, but that they are all substantially sensitive to lexical overlap, just like BLEU and ROUGE. This exposes limitations of these novelly proposed metrics, which we also highlight in an adversarial test scenario.

[1]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[2]  Anna Rumshisky,et al.  A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.

[3]  Yonatan Belinkov,et al.  Probing the Probing Paradigm: Does Probing Accuracy Entail Task Relevance? , 2020, EACL.

[4]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[5]  Guillaume Lample,et al.  What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties , 2018, ACL.

[6]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[7]  Nikolaos Aletras,et al.  Translation Error Detection as Rationale Extraction , 2021, FINDINGS.

[8]  Wei Zhao,et al.  The Eval4NLP Shared Task on Explainable Quality Estimation: Overview and Results , 2021, EVAL4NLP.

[9]  Wei Zhao,et al.  SUPERT: Towards New Frontiers in Unsupervised Evaluation Metrics for Multi-Document Summarization , 2020, ACL.

[10]  Philipp Koehn,et al.  Findings of the 2015 Workshop on Statistical Machine Translation , 2015, WMT@EMNLP.

[11]  Lucia Specia,et al.  BERGAMOT-LATTE Submissions for the WMT20 Quality Estimation Shared Task , 2020, WMT.

[12]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[13]  Kaizhong Zhang,et al.  Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[14]  Kun Qian,et al.  A Survey of the State of Explainable AI for Natural Language Processing , 2020, AACL/IJCNLP.

[15]  C.-C. Jay Kuo,et al.  SBERT-WK: A Sentence Embedding Method by Dissecting BERT-Based Word Models , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Holger Schwenk,et al.  Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond , 2018, Transactions of the Association for Computational Linguistics.

[17]  Iryna Gurevych,et al.  Text Processing Like Humans Do: Visually Attacking and Shielding NLP Systems , 2019, NAACL.

[18]  John Hewitt,et al.  Designing and Interpreting Probes with Control Tasks , 2019, EMNLP.

[19]  Naveen Arivazhagan,et al.  Language-agnostic BERT Sentence Embedding , 2020, ArXiv.

[20]  Atsushi Fujita,et al.  Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers , 2021, ACL.

[21]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[22]  Fei Liu,et al.  MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance , 2019, EMNLP.

[23]  Ray Kurzweil,et al.  Multilingual Universal Sentence Encoder for Semantic Retrieval , 2019, ACL.

[24]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[25]  Philip Bille,et al.  A survey on tree edit distance and related problems , 2005, Theor. Comput. Sci..

[26]  Philipp Koehn,et al.  Findings of the 2017 Conference on Machine Translation (WMT17) , 2017, WMT.

[27]  Sampo Pyysalo,et al.  Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection , 2020, LREC.

[28]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[29]  Lucia Specia,et al.  SentSim: Crosslingual Semantic Evaluation of Machine Translation , 2021, NAACL.

[30]  Ray Kurzweil,et al.  Learning Cross-Lingual Sentence Representations via a Multi-task Dual-Encoder Model , 2019, RepL4NLP@ACL.

[31]  Thibault Sellam,et al.  BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.

[32]  Karin M. Verspoor,et al.  Findings of the 2016 Conference on Machine Translation , 2016, WMT.

[33]  Chi-kiu Lo,et al.  YiSi - a Unified Semantic MT Quality Evaluation and Estimation Metric for Languages with Different Levels of Available Resources , 2019, WMT.

[34]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[35]  Eneko Agirre,et al.  SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[36]  Nitika Mathur,et al.  Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics , 2020, ACL.

[37]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[38]  Danqi Chen,et al.  A Fast and Accurate Dependency Parser using Neural Networks , 2014, EMNLP.

[39]  Iryna Gurevych,et al.  Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation , 2020, EMNLP.

[40]  Xipeng Qiu,et al.  BERT-ATTACK: Adversarial Attack against BERT Using BERT , 2020, EMNLP.

[41]  Steffen Eger,et al.  BERT-Defense: A Probabilistic Model Based on BERT to Combat Cognitively Inspired Orthographic Adversarial Attacks , 2021, FINDINGS.

[42]  Markus Freitag,et al.  BLEU Might Be Guilty but References Are Not Innocent , 2020, EMNLP.

[43]  Yonatan Belinkov,et al.  Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks , 2016, ICLR.

[44]  Wei Zhao,et al.  On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation , 2020, ACL.

[45]  Dan Klein,et al.  Multilingual Alignment of Contextual Word Representations , 2020, ICLR.

[46]  Jason Baldridge,et al.  PAWS: Paraphrase Adversaries from Word Scrambling , 2019, NAACL.

[47]  Dipanjan Das,et al.  BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[48]  Iryna Gurevych,et al.  How to Probe Sentence Embeddings in Low-Resource Languages: On Structural Design Choices for Probing Task Evaluation , 2020, CONLL.

[49]  Markus Freitag,et al.  KoBE: Knowledge-Based Machine Translation Evaluation , 2020, FINDINGS.

[50]  Alon Lavie,et al.  COMET: A Neural Framework for MT Evaluation , 2020, EMNLP.