BERTScore: Evaluating Text Generation with BERT

We propose BERTScore, an automatic evaluation metric for text generation. Analogously to common metrics, BERTScore computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings. We evaluate using the outputs of 363 machine translation and image captioning systems. BERTScore correlates better with human judgments and provides stronger model selection performance than existing metrics. Finally, we use an adversarial paraphrase detection task to show that BERTScore is more robust to challenging examples when compared to existing metrics.

[1]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[2]  N. Draper,et al.  Applied Regression Analysis , 1966 .

[3]  Hermann Ney,et al.  Accelerated DP based search for statistical translation , 1997, EUROSPEECH.

[4]  Leonidas J. Guibas,et al.  A metric for distributions with applications to image databases , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[5]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[6]  Dietrich Klakow,et al.  Testing the correlation of word error rate and perplexity , 2002, Speech Commun..

[7]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[8]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[9]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[10]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[11]  Rada Mihalcea,et al.  Measuring the Semantic Similarity of Texts , 2005, EMSEE@ACL.

[12]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[13]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[14]  Hermann Ney,et al.  CDER: Efficient MT Evaluation Using Block Movements , 2006, EACL.

[15]  Chris Callison-Burch,et al.  Open Source Toolkit for Statistical Machine Translation: Factored Translation Models and Lattice Decoding , 2006 .

[16]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[17]  Nitin Madnani,et al.  Fluency, Adequacy, or HTER? Exploring Different Human Judgments with a Tunable MT Metric , 2009, WMT@EACL.

[18]  Kevin Duh,et al.  Automatic Evaluation of Translation Quality for Distant Language Pairs , 2010, EMNLP.

[19]  Vasile Rus,et al.  A Comparison of Greedy and Optimal Assessment of Natural Language Student Input Using Word-to-Word Similarity Metrics , 2012, BEA@NAACL-HLT.

[20]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[21]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[22]  Khalil Sima'an,et al.  BEER: BEtter Evaluation as Ranking , 2014, WMT@ACL.

[23]  Timothy Baldwin,et al.  Testing for Significance of Increased Correlation with Human Judgment , 2014, EMNLP.

[24]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[25]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[26]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Maja Popovic,et al.  chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.

[28]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[29]  Jianfeng Gao,et al.  deltaBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets , 2015, ACL.

[30]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[31]  Jakob Uszkoreit,et al.  A Decomposable Attention Model for Natural Language Inference , 2016, EMNLP.

[32]  Qun Liu,et al.  Achieving Accurate Conclusions in Evaluation of Automatic Machine Translation Metrics , 2016, NAACL.

[33]  Ondrej Bojar,et al.  Results of the WMT16 Metrics Shared Task , 2016 .

[34]  Kristina Toutanova,et al.  A Dataset and Evaluation Metrics for Abstractive Compression of Sentences and Short Paragraphs , 2016, EMNLP.

[35]  Hermann Ney,et al.  CharacTer: Translation Edit Rate on Character Level , 2016, WMT.

[36]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[37]  Chi-kiu Lo,et al.  MEANT 2.0: Accurate semantic MT evaluation for any output language , 2017, WMT.

[38]  Hannes Schulz,et al.  Relevance of Unsupervised Metrics in Task-Oriented Dialogue for Evaluating Natural Language Generation , 2017, ArXiv.

[39]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[40]  Joelle Pineau,et al.  Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses , 2017, ACL.

[41]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[42]  Maja Popovic,et al.  chrF++: words helping character n-grams , 2017, WMT.

[43]  Qun Liu,et al.  Blend: a Novel Combined MT Metric Based on Direct Assessment — CASICT-DCU submission to WMT17 Metrics Task , 2017, WMT.

[44]  Stefan Thater,et al.  A Mixture Model for Learning Multi-Sense Word Embeddings , 2017, *SEMEVAL.

[45]  Ondrej Bojar,et al.  Results of the WMT17 Metrics Shared Task , 2017, WMT.

[46]  Mamoru Komachi,et al.  RUSE: Regressor Using Sentence Embeddings for Automatic Machine Translation Evaluation , 2018, WMT.

[47]  Jian Zhang,et al.  Natural Language Inference over Interaction Space , 2017, ICLR.

[48]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[49]  Michel Simard,et al.  Alibaba Submission to the WMT18 Parallel Corpus Filtering Task , 2018, WMT.

[50]  Sudip Kumar Naskar,et al.  ITER: Improving Translation Edit Rate through Optimizable Edit Costs , 2018, WMT.

[51]  Junfeng Hu,et al.  Meteor++: Incorporating Copy Knowledge into Machine Translation Evaluation , 2018, WMT.

[52]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[53]  Samy Bengio,et al.  Tensor2Tensor for Neural Machine Translation , 2018, AMTA.

[54]  Serge J. Belongie,et al.  Learning to Evaluate Image Captioning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[55]  Andrew Gordon Wilson,et al.  Probabilistic FastText for Multi-Sense Word Embeddings , 2018, ACL.

[56]  Prakhar Gupta,et al.  Learning Word Vectors for 157 Languages , 2018, LREC.

[57]  Percy Liang,et al.  The price of debiasing automatic metrics in natural language evalaution , 2018, ACL.

[58]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[59]  Ondrej Bojar,et al.  Results of the WMT18 Metrics Shared Task: Both characters and embeddings achieve good performance , 2018, WMT.

[60]  Iryna Gurevych,et al.  Concatenated Power Mean Word Embeddings as Universal Cross-Lingual Sentence Representations , 2018, 1803.01400.

[61]  Myle Ott,et al.  Scaling Neural Machine Translation , 2018, WMT.

[62]  Percy Liang,et al.  Unifying Human and Statistical Evaluation for Natural Language Generation , 2019, NAACL.

[63]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[64]  Junfeng Hu,et al.  Meteor++ 2.0: Adopt Syntactic Level Paraphrase Knowledge into Machine Translation Evaluation , 2019, WMT.

[65]  Iryna Gurevych,et al.  Alternative Weighting Schemes for ELMo Embeddings , 2019, ArXiv.

[66]  Hermann Ney,et al.  EED: Extended Edit Distance Measure for Machine Translation , 2019, WMT.

[67]  Yejin Choi,et al.  Cooperative Generator-Discriminator Networks for Abstractive Summarization with Narrative Flow , 2019, ArXiv.

[68]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[69]  Fei Liu,et al.  MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance , 2019, EMNLP.

[70]  Noah A. Smith,et al.  Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts , 2019, ACL.

[71]  Chi-kiu Lo,et al.  YiSi - a Unified Semantic MT Quality Evaluation and Estimation Metric for Languages with Different Levels of Available Resources , 2019, WMT.

[72]  Yang Liu,et al.  Fine-tune BERT for Extractive Summarization , 2019, ArXiv.

[73]  Osmar R. Zaïane,et al.  ANA at SemEval-2019 Task 3: Contextual Emotion detection in Conversations through hierarchical LSTMs and BERT , 2019, *SEMEVAL.

[74]  Mamoru Komachi,et al.  Filtering Pseudo-References by Paraphrasing for Automatic Evaluation of Machine Translation , 2019, WMT.

[75]  Graham Neubig,et al.  Beyond BLEU:Training Neural Machine Translation with Semantic Similarity , 2019, ACL.

[76]  William Yang Wang,et al.  Deep Reinforcement Learning with Distributional Semantic Rewards for Abstractive Summarization , 2019, EMNLP.

[77]  Lucia Specia,et al.  WMDO: Fluency-based Word Mover’s Distance for Machine Translation Evaluation , 2019, WMT.

[78]  Timothy Baldwin,et al.  Putting Evaluation in Context: Contextual Embeddings Improve Machine Translation Evaluation , 2019, ACL.

[79]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[80]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[81]  Jimmy J. Lin,et al.  Simple Applications of BERT for Ad Hoc Document Retrieval , 2019, ArXiv.

[82]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[83]  Yann Dauphin,et al.  Pay Less Attention with Lightweight and Dynamic Convolutions , 2019, ICLR.

[84]  Yonatan Belinkov,et al.  Linguistic Knowledge and Transferability of Contextual Representations , 2019, NAACL.

[85]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[86]  Jason Baldridge,et al.  PAWS: Paraphrase Adversaries from Word Scrambling , 2019, NAACL.

[87]  Boxing Chen,et al.  Alibaba Submission to the WMT20 Parallel Corpus Filtering Task , 2020, WMT.

[88]  Eunjeong Park,et al.  A context-aware citation recommendation model with BERT and graph convolutional networks , 2019, Scientometrics.